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ABSTRACT 

Information  may  be  stored  in  the  memory  or  the  memory  exten- 
sions of  digital  computers.   The  ease  with  which  this  information 
may  be  manipulated  and  retrieved  is  greatly  complicated  if  it  is 
of  the  non -numeric  type.   However,  techniques  are  known  which 
facilitate  the  solutions  to  these  information  storage  and  retrieval 
problems.   These  techniques  allow  the  optimization  of  efficiency 
according  to  one  or  more  criteria.   Various  list  languages  are 
introduced  and  the  overall  information  storage  and  retrieval 
problem  is  outlined.   Four  basic  storage  allocation  systems  are 
discussed  and  their  relative  merits  compared.   Tree  structures 
embodying  the  best  features  of  the  four  basic  systems  are  described 
and  evaluated „   Efficiency  computations  from  the  standpoint  of 
search  time  and  the  search-time,  storage-space  product  are.  pre- 
sented.  A  specific  tree  structure  called  a  trie  is  introduced 
as  a  compromise  between  the  two  extremes  of  the  balanced  tree  and 
the  fully  elided  unbalanced  tree.   Finally,  a  technique  called 
Fibonaccian  searching  is  presented  as  a  method  to  conserve  both 
storage  space  and  time. 
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1.    Introduction. 

It  is  the  purpose  of  this  paper  to  bring  together  in  a  single 
document  a  description  of  some  of  the  schemes  which  have  been  pro- 
posed in  the  field  of  Information  Storage  and  Retrieval.   Further, 
it  is  intended  that  the  emphasis  be  on  the  efficiencies  of  the 
various  schemes  and  how  these  efficiencies  are  modified  by  inter- 
action during  combination. 

Automated  information  storage  and  retrieval  has  for  some  time 
been  a  dream  of  scientists  and  engineers „   A  most  provocative  article 
was  written  by  Vannevar  Bush  just  after  World  War  II  describing  the 
"library  problem"  and  a  proposed  solution.  Tl^     Most  of  what  was 
proposed  could  then  only  be  classified  as  speculation  but  we  have 
seen  that  this  speculation  was  well  based.   His  proposed  private  file 
and  library,  which  he  named  the  "memex",  is  within  the  capabilities 
of  our  technology  today.   The  bar  to  its  realization  is  the.  com- 
plexity of  the  indexing  and  manipulation  of  non-numerical  information. 
Much  work  has  been  done  on  the  problem,  but  we  are  still  a  long  way 
from  a  completely  automated  system.   It  is  not  suggested  that  what 
follows  here  is  an  exhaustive  compilation  of  the  state-of-the-art. 
It  is  rather  a  sampling  to  indicate  the  current  status  of  some  of 
our  attempts  at  completely  mechanized  information  storage  and  retrieval 

It  is  intended  that  Section  2  will  provide  an  introduction  to 
a  widely  used  device  for  manipulation  of  non-numeric  information, 
i.e.  the  list.   Also  it  briefly  describes  several  computer  languages 
which  use  lists.   Section  3  describes  the  information  storage  and 
retrieval  problems  of  1)  automatic  determination  of  document 


content,  2)  organization  of  structured  information  within  a 
computer,  and  3)  evaluation  of  systems  „   Specific  file,  processing 
operations  are  discussed  in  Section  4  in  order  to  provide  a  common 
basis  for  evaluation  of  the  storage  allocations  described  in 
Sections  5  and  6.   Sections  5  and  6  contain  programs  which  describe, 
the  searches  and  are  intended  to  convey  a  feeling  for  the  type  and 
amount  of  processing  required.   Section  7  shows  how  it  is  possible 
to  design  trees  to  minimize  the  time  necessary  to  search  them  while 
at  the  same  time  holding  the  tree's  storage  requirements  to  a 
minimum.   A  comparison  is  made  in  Section  9  between  the  storage 
requirements  of  trees  representing  both  extreme  and  mean  types. 
Finally,  an  efficient  method  of  searching  an  ordered  list  in  a 
machine  not  capable,  of  a  binary  shift  is  given  in  Section  9. 


2.    Hueristic  programming  and  the  development  of  list  languages. 

The  application  of  digital  computers  has  been  most  successful 
thus  far  in  areas  where  the  manipulation  of  numeric  quantities  is 
the  primary  operation.   More  recently,  attempts  are  being  made  to 
handle  more  complex  and  ill  structured  problems  such  as  theorem 
proving,  playing  chess  and  checkers,  or  performing  various  symbolic 
calculations  like  differentiation  and  integration.   The  motivations 
behind  this  work  range  from  a  desire  to  extend  the  capabilities  of 
computers  to  the  desire  to  understand  how  humans  think,  learn,  and 
solve  problems.   ^24] 

In  a  non-numeric  problem  the  contents  of  words  in  memory  are  not 
treated  as  though  they  contained  numbers  (although  in  fact  they  do). 
Rather,  the  contents  are  treated  as  alphanumeric  strings  of  characters 
or  symbols.   For  example,  they  might  be  algebraic  expressions,  coded 
information,  or  English  words.   For  this  reason  the  term  symbol 
manipulation  is  sometimes  used  to  characterize  the  procedures 
used.  £35j 

A  fairly  common  characteristic  of  a  nonnumerical  problem  is 
that  there  is  no  algorithm  or  standard  procedure  for  solving  it. 
More  precisely,  the  algorithms  needed  for  solution  are  very  complex 

if  they  are  known  to  exist  at  all.   In  addition,  there  is  a 

need  for  a  unit  of  data  larger  than  a  single  number,  and  the  data 
must  be  able  to  change  in  both  structure  and  content  as  the  problem 
is  solved. 


A  solution  to  this  problem  has  been  found  in  the  list.   A  list 

may  be  defined  as  an  ordered  set  of  tokens,  connected  in  a  structure 

such  that  at  least  one  successor  to  each  token  can  be  identified. 

[_35j    In  some  list  structures,  predecessors  of  tokens  can  be 

similarly  identified,   A  simple  example  of  a  list  is  an  ordered  set 

of  integers.   The  successor  of  one  integer  is  the  "next  larger" 

integer.   A  more  complex  example  is  illustrated  in  Figure  1.   In 

this  example  (taken  from   C^D   )  the  address  of  each  word  in  the 

list  is  stored  in  the  word  that  precedes  it  in  the  list.   Each  word 

in  the  list  may  contain  two  addresses;  one  is  the  address  of  the 

item  that  belongs  at  that  point,  and  the  other  is  the  address  of  the 

next  word  in  the  list.   The  latter  address  is  the  pointer  address. 

The  numbers  outside  the  boxes  are  location  addresses  for  the  boxes. 

The  list  tokens  indicated  in  this  example  comprise  the  symbolic 

expression: 

(NAME -LIST/WORD)+ABC 

Note  that  the  order  of  the  tokens  is  given  by  the  pointer  addresses; 

the  tokens  need  not  be  listed  in  memory  in  sequence.   The  pointer 

address  00000  signals  the  end  of  the  list. 

With  a  list  structure,  the  insertion  or  deletion  of  tokens  is 

relatively  simple.   For  example,  to  insert  the  subexpression  +DIGIT 

immediately  after  NAME  in  the  expression  in  Figure  1,  the  pointers 

are  modified  as  shown  in  Figure  2.   Two  words  were  added  to  the 

list;  only  one  pointer  already  in  the  list  was  changed  (from  02205 

to  02221),  and  everything  else  in  the  list  was  unchanged.   Two 

tokens  were  added  to  the  block  at  04000.   Deletion  of  tokens  is  also 
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simple;  pointers  are  modified,  and  nothing  else  is  done.   There  will 
be  extraneous  information  in  words  that  were  once  part  of  the  list, 
but  since  pointers  bypass  these  words,  no  reference  is  made  to  them. 

As  lists  are  processed,  many  are  replaced  by  others  or  are 
for  some  reason  of  no  further  use.   If  the  space  taken  by  such  lists 
is  never  again  used,  the  available,  space  within  the  computer  may 
eventually  be  depleted.   The  words  no  longer  needed  are  scattered 
all  over  memory,  and  a  means  is  required  for  determining  where,  they 
are.   This  is  accomplished  by  forming  a  list  of  available  space  (LAS) 
In  this  list  each  word  points  to  a  successor  and  to  an  available 
word.   This  list  has  the  same  structure  as  the  list  described.   If 
the  scheme  is  to  function  properly,  each  list  operation  must  supply 
to  the  LAS  those  words  made  free  by  the  operation.   Thus  at  any  time 
all  free  space  is  linked  into  one  list.   The  programmer  is  free  of 
worry  over  the  space. 

Several  languages  have  been  developed  during  the  past  few 
years  which  use  lists  and  list  structures  as  a  fundamental  working 
unit.   A  brief  review  of  this  work  illustrates  some  of  the  applica- 
tions of  list  processing  languages  and  the  various  ways  in  which 
list  processing  can  be  introduced  into  programming. 

The  work  of  Newell,  Shaw,  and  Simon  began  in  late  1954,  trig- 
gered by  the  pioneering  work  of  Selfridge  and  Dineen  on  a  program 
for  recognizing  visual  patterns.  £4^   JJ34]   They  first  worked 
on  chess  and  then  switched  to  the  task  of  proving  theorems  in  the 
propositional  calculus  of  Whitehead  and  Russell,  L26J   They 
originally  devised  some  languages  that  were  tied  closely  to  subject 


matter a  "chess  language"  and  a  "logic  language".   These  languages 

collectively  called  IPL-I  (Information  Processing  Language-I) , 
although  designed  as  pseudo-codes,  never  reached  the  coded  stage 
(see  £25j  for  description).  [24]  [39]   £27j   The  I PL  series, 
I  through  VI,  was  developed  for  use  with  symbolic  logic  programs  and 
hueristic  programs  such  as  chess  programs,  programs  for  use  in 
management  (e.g.  assigning  tasks  to  work  stations  for  an  assembly 
line)  and  a  program  called  GPS,  for  General  Problem  Solver,  which 
was  an  effort  to  simulate  human  behavior. 

H.  Gelernter  has  developed  a  program  for  proving  theorems  in 
plane  geometry.  [_9J   LllJ     To  do  this  he  and  his  colleagues 
developed  a  list  processing  adjunct  to  FORTRAN  for  the  704  called 
FLPL  for  FORTRAN  List  Processing  Language,   [[loj   This  was  done 
by  adding  a  series  of  subroutines  to  the  FORTRAN  system.   FLPL  uses 
lists  only  for  data,  since  it  uses  FORTRAN  produced  machine  code 
for  routines. 

J.  McCarthy  has  developed  another  list  language  for  the  704 
called  LISP,  for  List  Processor,  [llj        This  one,  like  IPL-V, 
uses  lists  for  both  routines  and  data.   Externally,  LISP  uses  a 
horizontal  notation  in  which  list  structures  are  represented  with 
the  aid  of  parentheses.   The  identity  of  parenthetical  notation 
and  list  structures  can  be  seen  from  the  following  figure: 
(A,B,  (C,D)) 


Part  of  the  reason  for  the  development  of  LISP  is  McCarthy's  work 
with  Mo  Minsky  on  heuristic  programs.   There  has  also  been  some 
work  done  on  analytic  differentiation  and  integration  of  elementary 
functions  in  LISP. 

Another  list  processing  language  which  was  designed  to  retain 
the  advantages  of  the  preceding  languages  while  simplifying  machine 
processing  of  lists  is  called  a  Threaded  List.   It  was  developed 
by  A.  J.  Per lis  and  Charles  Thornton.   [2<0   A  threaded  list  is 
a  list  structure  in  which  the  last  element  of  each  list  specifies 
the  location  of  the  head  of  the  list  of  which  it  is  the  terminal 
member.   The  advantage  of  this  structure  is  that  it  permits  the  de- 
finition of  various  modes  for  sequencing  through  lists  without 
requiring  the  use  of  the  usual  push-down  lists  for  retaining 
sequencing  information.   This  corresponds,  in  the.  representation 
of  programs  by  list  structures,  to  the  representation  of  all  com- 
putable functions  iteratively  rather  than  recursively. 

Weizenbaum  developed  the  KLS  (Knotted  List  Structure)  language 
while  considering  the  relationship  between  computer  languages  and 
computer  organizations,  and  the  hardware -software  tradeoffs  which 
this  relationship  makes  possible „  ^39]    Later  he  developed  SLIP 
(Symetric  List  Processor)  which  is  a  descendent  of  FLPL3  IFL-V, 
Threaded  Lists,  and  his  own  KLS.   In  this  system  each  list  cell 
carries  both  a  forward  and  a  backward  link  as  well  as  a  datum 
(token  of  information).   It  is  symetric  in  the  sense  that  its  lists 
do  not  have  a  preferred  orientation.   Any  operation  which  can  be 
carried  out  on  the  top  of  a  list  can  just  as  easily  be  carried  out 


on  the  bottom.   It  is  a  language  system  designed  to  be  imbedded  in 
a  higher  order  language  capable  of  calling  machine  language  sub- 
routines (like  FLPL).   [40j 

The  Philco  Company  has  implemented  the  IPL-V  language  on  their 
2000  computer  as  a  set  of  macro-operations,  subroutines  and  con- 
ventions supplementing  TAC  (Translator -Assembler -Compiler ,  the 
assambly  language  for  the  2000).   These  macros,  subroutines,  and 
conventions  will  be  referred  to  as  TALL  (TAC  List  Language).   TALL 
uses  the  loading  facilities  of  TAC,  the  IPL-V  primitive  processes, 
and  a  set  of  subroutines  performing  the  work  of  the  interpreter. 
The  macros  aid  in  the  translation  from  IPL-V  to  TAC.   The  macros 
and  the  primitive  processes  can  be  placed  on  the  TAC  subroutine 
library  tape  and  called  in  as  required  during  assembly. 

The  implementation  of  IPL-V  in  this  fashion  has  several  ad- 
vantages s   (1)  the  time  required  to  get  a  basis  IPL-V  system  running 
on  the  2000  is  only  three  man -weeks;  (2)  symbolic  machine  language 
instructions  can  easily  be  inserted  into  TALL  programs;  (3)  IPL-V 
statements  can  be  used  in  conjunction  with  FORTRAN  statements  or 
JOVIAL  statements;  and  (4)  no  additional  work  is  required  to  make 
TALL  compatible  with  any  monitor  system  for  the  2000.  [[-5 J 

Compilers  for  one  or  more  list  processing  languages  exist 
today  for  every  major  general  purpose  computer. 

In  recognition  of  the  fact  that  not  all  problems  are  purely 
of  the  symbol  manipulation  type,  it  might  be  well  to  mention  a 
technique  for  symbol  manipulation  and  numerical  calculation.   Ross 
has  developed  a  type  of  generic  structure  which  he  calls  a  plex,  [31J 
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There  exist  problems  which  are  purely  numerical ,  thereby  net  requir- 
ing symbol  manipulation  techniques,  and  there  also  exist  problems 
which  are  purely  symbolic,  thereby  not  requiring  efficient  handling 
of  data  for  computation.   But  there  is  a  much  larger  class  of 
problems  in  which  both  symbolic  manipulation  and  numerical  calcula- 
tion are  required  and  are  inextricably  intertwined,,   The  pi  ex 
structure  is  designed  to  handle  this  latter  type.   A  claim  is  made 
that  this  structure  is  more  powerful  than  the  list  structure  (it 
includes  lists  as  a  subcase)  and  appears  to  be  better  suited  for 
the  concise  representation  of  the  complex  interrelations  of  elements 
which  constitute  a  "problem".   C3lT]   Note:   The  word  plex  is  an 
abbreviation  of  the  word  plexus :   "An  interwoven  combination  of 
parts  of  a  structure;  a  network."  ----Webster. 
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3.    Information  storage  and  retrieval. 

One  important  subset  of  the  set  of  non-numeric  applications  of 
the  digital  computer  is  information  storage  and  retrieval.   The  most 
recent  and  profitable  improvements  in  the  capabilities  of  digital 
computers  have  been  the  result  of  software  innovations,  i.  e. 
assembly  programs,  subroutines,  compilers  or  translators,  etc. 
However,  as  the  result  of  work  in  the  symbol  manipulation  area, 
certain  hardware  specifications  have  been  shown  desireable.   These 
may  be  classified  into  two  main  groups,  1)  those  which  apply  to  the 
modern  random  access  magnetic  core  type  memory,  and  2)  those  which 
apply  to  the  new  associative  type  memories.   Typical  of  the  refer- 
ences of  the  former  are   [4lJ   and  £23j  ,  and  of  the  latter 

£3(f]   and  £28^   .   It  is  not  the  intent  of  this  paper  to  become 
involved  in  detailed  hardware  comparisons.   On  the  other  hand,  most 
of  the  discussion  that  follows  is  concerned  with  software  solutions 
to  problems  related  to  storage  and  manipulation  of  structured  in- 
formation within  the  magnetic  core,  random  access,  type  computers 
(the  most  common  in-use  general  purpose  computers).   A  word  of  warning 
to  the  reader  is  appropriate  at  this  point.   The  literature  on  this 
subject  often  uses  the  term  "associative  memory"  without  distinc- 
tion between  the  true  associative  memories  and  associative  memories 
implemented  in  present  day  addressable  memories  by  simulation 
techniques,  e.  g.  see  [.171] 

Most  of  what  has  been  written  regarding  information  storage 
and  retrieval  starts  with  the  very  convenient  assumption  that  the 
information  has  previously  been  formatted,  keyed,  and  stored.   L-^J 
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This  does  not  mean,  however,  that  the  subject  of  automatic  determina- 
tion of  content  of  documents  has  been  completely  neglected.   There 
have  been  several  attacks  on  the  problem, 

A  well  known  method  involves  a  frequency  count  of  words  in  the 
text;  those  words  which  occur  most  frequently  are  judged  to  be  the 
most  significant  and  are  used  as  indicators  of  the  content  of  the 
document.   A  promising  variant  to  this  method  has  been  proposed  which 
utilizes  only  the  initial  occurrences  of  nouns  of  the  text  to  detect 
document  content.   The  results  indicate  that  the  proposed  method  may 
be  useful  for  both  the  automatic  indexing  of  documents  and  the  pro- 
duction of  automatic  abstracts.  [_13 J 

A  completely  different  approach  to  automatic  indexing  and 
classification  is  the  use  of  bibliographic  references  provided  with 
the  document  as  quasi -index  terms  to  improve  indexing  methods.   An 
outstanding  advantage  of  this  method  is  that  processing  the  full  text 
of  the  document  is  not  required.   Although  the  formats  of  biblio- 
graphic references  may  vary  in  different  documents,  there  has  been 
some  success  at  using  a  computer  program  to  put  these  various  forms 
into  a  standard  form,  so  that  they  may  then  be  processed  by  another 
program. 

Another  class  of  techniques  to  aid  in  the  automatic  determina- 
tion of  document  content  utilizes  a  syntactic  analysis  of  some  of 
the  sentences  of  the  document.   In  a  retrieval  system  of  this  type 
there  are  numerous  alternatives  available  in  the  degree  and 
thoroughness  of  analysis. 
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The  use  of  the  semantic  values  of  the  words  of  documents  has 
been  used  by  several  researchers  in  the  area  of  data  retrieval. 
There  are  difficulties  in  this  approach  as  might  be  expected  when 
different  authors  naturally  use  different  words  to  express  similar 
ideas.   These  semantic  relations  may  also  be  supported  by  a  syn- 
tactic analysis  as  mentioned  above. 

Another  area  of  information  storage  and  retrieval  which  has 
been  thoroughly  investigated  is  the  organization,  of  structured  in- 
formation in  a  computer  and  the  effects  of  various  allocations 
systems  upon  search  strategies  and  matching  procedures.   Three  basic 
formats  ---  vector,  tree,  and  graph  ---  are  encountered  in  the 
literature  for  the  representation  of  structured  information.   Although 
the  three  forms  are  closely  related,  it  is  often  convenient  to 
consider  them  separately  when  analyzing  characteristics  of  systems 
associated  with  them. 

During  the  past  several  years,  techniques  for  the  evaluation 
of  information  storage  and  retrieval  systems  have  been  developed. 
Swets   has  reviewed  ten  different  measures  for  evaluation  and 
proposes  another  based  on  statistical  decision  theory.    t^SJ 

The  various  measures  reviewed  have,  much  in  common.   Eight  of 
them  evaluate  only  the  effectiveness  (accuracy,  sensitivity,  dis- 
crimination) of  a  retrieval  system  and  are  derived  completely,  in 
one  way  or  another,  from  the  two-by-two  contingency  table  of 
pertinence  and  retrieval  represented  in  Fig.  3.   The  other  three 
measures  assess  efficiency  as  well  as  effectiveness  by  including 
such  performance  factors  as  time,  convenience,  operating  cost, 
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and  product  form.   The  questions  of  effectiveness  and  efficiency 
are  difficult  to  answer  and  care  must  be  taken  in  any  comparative 
analysis.   [38]   [20]   [2l] 
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Figure  3. 
The  Two-by-Two  contingency  table  of  pertinence  and  retrieval „ 
P  and  P  denote,  respectively,  pertinent  and  nonpertinent  items  g 
R  and  R  denote,  respectively,  retrieved  and  unretrieved  items; 
a,  b,  c,  and  d  represent  the  simple  or  weighted  frequencies  of 
occurrence  of  the  four  conjunctions;  V.  is  the  value  of  retrieving 
a  pertinent  item;  V_  is  the  value  of  not  retrieving  a  nonpertinent 
item;  K  is  the  cost  of  retrieving  a  nonpertinent  item;  and  K_  is 
the  cost  of  failing  to  retrieve  a  pertinent  item.   ^38^ 
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4.    File  processing  operations. 

One  of  the  major  considerations  when  deciding  upon  a  particular 
storage  allocation  scheme  for  use  in  an  information  processing 
system  is  its  effectiveness  and/or  efficiency,,   In  order  to  make 
meaningful  comparisons  between  various  schemes  considered,  it.  is 
necessary  to  use  parameters  and  file  processing  operations  which 
are  common  to  them  all,  insofar  as  this  is  possible.   Tverson  has 
developed  a  programming  language  and  a  notation  for  the  description 
of  programs.   £l6j   Part  of  his  notation  is  used  throughout  the 
remainder  of  this  paper.   See  Table  1  for  a  summary  of  parameters 
and  program  symbols. 

Consider  a  system  where  the.  major  subdivision  of  information 
is  a  file.   Let  each  file  be  further  subdivided  into  records,  and 
the  records  into  items.   In  order  to  realize  the  ability  to  mani- 
pulate the  files  as  desired,  some  means  must  be  established  to 
recognize  units  of  information  down  through  the  smallest,  i.  e.  the 
item.   Each  item,  therefore,  must  have  associated  with  It  a  tag  @r 
key,  or  descriptor  which  is  unique.   In  fact,  it  is  useful  to  define 
the  item  as  having  two  parts,  1)  the.  key  which  is  that  part,  which 
distinguishes  the  item  from  all  other  items,  and  2)  the  function 
which  is  that  part  which  is  not  the  key.  \j>l~} 

Each  key  corresponding  to  an  item  of  the  file  is  assumed  to 
be  an  h-tuple,  viz., 

s_  £  S  X  S  X  . . .  X  S      where  s_  =  an  argument  key 
12  h  S,=  ith  set 
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and  each  of  the  sets,  S,  ,  is  considered  to  have  m  mebers  or 

1   1 

characters.   Thus  there  are  m  distinct  keys  possible,  e.  g.  if  we 

restrict  the  keys  to  any  six  alphanumeric  s  (a  6 -tuple)  t  the  a ; 

h  =  6 

m  =  36  =  26  letters  +  10  digits 

mh=  366  2  2  X  109 

The  file  is  assumed  to  contain  n  items,  all.  distinct.   All 

items  have  the  same  probability  of  being  selected. 

h  , 

If  a  parameter  d  is  chosen  so  that  d  m   n,  then  d/m  is  a  measure 

of  the  frequency  of  occurrence  of  each  character. 

The  physical  memory  in  which  the  file  is  stored  is  assumed  to 

contain  2  words,  so  that  "a"  bits  are  required  to  address  any  word 

uniquely.   The  representation  of  each  character  is  assumed  to  take 

<   b 
b  bits,  where  m  —  2  .   Thus  hb  bits  are  required  to  represent  an 

h-tuple  £.   Therefore,  in  our  example,  b  ~   6  and  hb  =  36. 

Two  types  of  criteria  are  used  to  compare  the  storage  alloca- 
tion methods?  the  storage  capacity  required  and  the  time  required 
to  perform  a  particular  file -processing  operation. 

Three  measures  of  storage  capacity  are  useds   1)  the  Bumber, 
w  ,  of  bits  per  memory  word  required  to  represent  both  an  item  and 
any  allocation  information  associated  with  that  item  (w  ■  number  of 
bits  per  memory  word  if  the  word  length  is  variable,  axid   equals 
some  integral  multiple  of  a  standard  word  if  the  word  length  is 
fixed) ,  2)  the  total  number  of  words  necessary  to  represent  the 
file  in  a  particular  allocation,  and  3)  the  total  number  of  bits 
necessary. 
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It  is  convenient  to  normalize  the  latter  two  measures.    [36j 

r(words)  = 

total  number  of  words  to  represent  the  file  and  associated  allocation  data 
number  of  words  to  represent  the.  file 

r(bits)  = 

total  number  of  bits  to  represent  the  file  and  associated  allocation  data 
number  of  bits  to  represent  the  file 

By  the  assumptions  of  item  and  file  size  the  denominators  of  r (words) 

and  r(bits)  are  n  and  nb(h  +  e) ,  respectively.   However,  computations 

are  simplified  if  nbh  is  used  as  the  number  of  bits  to  represent  the 

file. 

In  computing  the  storage  capacities  required,  the  locations 
needed  to  store  the  programs  and  intermediate  operating  data  are 
ignored.   This  is  justifiable  for  the  programs  (q.v.)  for  similar 
searches  in  different  allocations  all  have  approximately  the  same 
number  of  lines  and  use  about  the  same  number  of  parameters.   Moreover, 
the  fraction  of  total  storage  used  for  the  programs  is  much  less  than 
that  used  for  the  file  in  applications  where  these  schemes  are 
necessary. 

Typical  file-processing  operations  with  respect  to  which  time 

comparisons  are  made  are:   ^36] 

search  for  match 

search  for  multiple  match 

search  for  smallest  item 

search  for  item  next  smaller  than  a  given  item 

add  an  item  to  the  file 

delete  an  item  from  the  file 
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Iteration  of  either  "smaller"  search  can  sort  the  file.   Iteration 
of  the  add-item  operation  may  be  construed  as  a  recursive  procedure 
for  constructing  the  allocation.   Thus  if  the  time  to  construct  the 
allocation  is  required,  one  computes 


Z>; 


where  t.  is  the  add-item  time  for  a  file  of  i  items. 

1 

Two  items,  a  and  b  ,  are.  said  to  ma^ch  (a  =  b)   if  and  only  if, 

for  all  i,  a.   is  the  same  character  as  b 
'  —i  —i 

Two  items,  e_  and  b,   are  said  to  match  with  respect  to  n 

(  wrt  u) ,  where  u  is  a  logical  vector  of  dimension  h,  if  and  only  if 

u/a  =  u/b  :   that  is,  a,  =  b.  when  u.   =1  (here  we  are  using  Iverson's 
'   i    i      — i 

notation.   In  particular,  u/a  is  his  "compression"  operation).   [~16 J 
u  is  called  a  mask.   If  the.  mask  u  is  not  explicitly  mentioned,  it 
is  assumed  to  be  a  vector  or  all  l's. 

An  order  relation  is  defined  on  S  as  follows:   For  those  S, 

i 

that  are  ordered,  let  "  ^  "  be  the  relation  "precedes".   Roughly 
speaking,  a  <£  b  ,  if,  when  the  unordered  sets  are  ignored  and 
the  ordered  sets  are  considered  to  be  nonnegative  integers,  the 
"number"  a  is  less  than  the  "number"  b. 

The  restriction  of  the  ordering  with  respect  to  a  mask  u  is 
defined:   a  <  b  wrt  u  if  and  only  if  u/a  <  u/b. 

Those  items  s^  of  file  F  which  are  such  that,  for  all  £  in 
F,  s_  <_t  wrt  u,   are  the  smallest  items  of  the  file  wrt  u. 
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The  items  next  smaller  than  .s.  wrt  ju.  are  those  items  t  in  F 
such  that  t    <  s_  wrt  u  and  for  which  there  is  no  r  in  F  such  that 
t   <  r  <  £  wrt  u„ 

The  definition  of  the  ordering  relation  "  <  "  requires  that 
the  elements  of  the  h-tuple  are  weighted  by  the.  order  in  which  they 
occur  in  the.  h-tuple.   This  may  be  generalized  to  utilize  different 
orderings  as  specified  by  some  permutation  vector  £.   The  restricted 
ordering  with  respect  to  mask  u  may  still  be  used,  so  that 
a  <  b  wrt  u  wrt  £  if  and  only  if  u/a  ^  u/b  wrt  u/p_  (The 
expression  x  <  y_  wrt   z_  is  not  ambiguous  because,  a  given  z_  cannot 
be  both  a  permutation  vector  and  a  mask  vector  except  in  the  trivial 
case  z_   =  (1)).   If  the  permutation  £  is  not  explicitly  mentioned., 
it  is  assumed  to  be  the  identity  permutation. 

A  match  search  of  file  F  for  item  x  wrt  u  is  an  operation 
which  identifies  that  item  s  of  F  such  that  x  =  s_  wrt  u.   To 
qualify  as  a  match  search  u  must  be  so  chosen  that  the  search  identi- 
fies exactly  no  or  one  item  of  the  file.   If  u  is  such  that  several 
items  of  the  file  may  be  identified,  the  search  is  called  a  multiple- 
match  search.   In  general  a  multiple -match  search  is  more  complex 
than  a  match  search  so  that  it  is  assumed  that  the  user  of  the  file 
knows  when  to  expect  a  unique  result  and  when  he  will  be  content 
with  a  single  response  from  many  possibilities. 

As  an  example  illustrating  the  possible,  uses  of  the  suggested 
searches,  consider  the  file  to  be  a  personnel  record  of  a  large 
corporation.  |^36l   In  such  a  file  each  item  would  correspond  to 
one  employee  and  might  be  categorized  as; 
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Name 

Social  security  number 

Department  number 

Job  code 

Salary 

Hence,  each  item  is  a  5-tuple.   Note  that  in  this  example  the  sets 

of  the  file  are : 

S   £2  the  set  of  employees'  names 

S9  fcg    the  set  of  employees'  social  security  numbers 

S   ?s  the  set  of  departments 

S.  &    the  set  of  job  codes 
4 

S_  $2    the  set  of  salaries 

We  have  obviously  departed  from  our  original  example  where  the  sets 

S.  were  the  set  of  alphanumeric  characters.   This  is  done  to  make 
l 

the  explanation  of  searches  more  easily  seen. 

Probably  the  most  common  search  of  such  a  file  would  be  a 
match  search  to  locate,  the  item  for  a  particular  individual;  e.g. 
u-  (1,0,0,0,0) 
s_  =  (Smith  John  A, -,-,-,-) 
To  locate  all  the  secretaries  in  department  474 ,  a  mul tiple- 
match  search  would  be  conducted  with 
u  =  (0S0S1S1S0) 
s  =  (-}-,474:,secrys-) 
The  payroll  department  might  desire  an  alphabetical  listing 
of  all  employees  by  departments  and  salary;  that  iss  sort  the  file 
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with  major  category  "salary",  then  "department  number",  then  alpha- 
betically by  "name".   To  do  this,  a  next  larger  search  is  iterated 
with 

u  =  (1,0,1,0,1) 

£=  (3,-,2,-,l) 

is  =  last  item  retrieved 
and 

initial  s  =  (A,-,p,-,0)o 
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5.    Storage  allocation  systems  and  their  relative  merits. 

When  reduced  to  basic  considerations,  all  information  storage 
systems  for  addressable  storage  within  a  digital  computer  memory 
use  one  or  more  applications  of  four  basic  types,  i.  e„  random, 
ordered,  computable,  or  chained „   Each  has  its  own  advantages  and 
disadvantages.,   As  we  shall  see  later,  more,  complex  systems  which 
are  combinations  of  these  basic  systems  may  be  devised  which  allow 
optimization  for  particular  applications. 

In  this  part  the  four  simple  storage  allocation  systems  mentioned 
above  are  examined.   For  each,  the  description  of  the  al location , 
discussion  of  various  searches,  programs  for  the  searches,  and  cal- 
culations of  search  times  and  storage  requirements  are  given.   The 
treatment  here  has  been  adapted  from  Sussenguth  with  minor  modifica- 
tions .   f  36] 

The  programs  delineating  some  of  the  search  descriptions  are 
intended  to  convey  a  feeling  for  the  type  and  amount  of  processing 
required,  rather  than  being  exact  programs  which  cover  all  contin- 
gencies.  Indeed,  the  details  of  initialization  and  special  cases 
(e.g.,  multiple  responses  to  a  smaller  search)  are  frequently 
omitted  or  condensed  to  a  single  program  step.   In  contrast  to  this 
supression  of  detail,  in  the  main  iterative  loop  of  the  program, 
the  step  of  retrieving  an  item  from  memory 

a  -«—  M1 

is  explicitly  shown.   It  is  this  operation  that  is  assumed  to  be 
the  most  time-consuming,  and  it  is  the  number  of  times  that  this 
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step  is  executed  that  is  reflected  in  the  search  times  which  are- 
calculated.   If  the  operations  of  Initializing;,  indexing,  comparing, 
etc.  are  not  carried  out  in  parallel  with  the  retrieval  operation, 
or  their  execution  times  are  not  negligible  with  respect  to  it, 
the  program  will  give  an  idea  of  how  much  the  memory-access  time 
must  be  increased  to  reflect  a  proper  item  time. 

When  a  program  terminates  at  a  normal  exit  (error  exits  are 
marked  with  an  asterisk) ,  the  item  satisfying  the  search  is  contained 
in  the  accumulator  a.   For  those  searches  which  may  be  satisfied 
by  several  items ,  the  line  "process  a"  is  used  to  indicate  that 
the  present  contents  of  the  accumulator  satisfy  the  search;  after 
this  line  has  been  executed ,  the  search  continues  until  it  terminates 
at  an  exit. 

The  programs  are  all  written  in  the  Iverson  language.   Table  I, 
p.  vii   gives  the  legend  which  is  used  for  all  programs. 

The  measure  of  time  required  to  perform  an  operation  is  the 
number  of  items  that  must  be  "handled"  to  achieve  the  desired 
objective.   "Handling  an  item,"  an  intentionally  vague  phrase,  may 
be  construed  as  extracting  an  item  from  memory  and  transferring 
it  to  the  processing  unit  in  which  a  computation  or  decision  is 
performed  based  on  that  item.   In  general,  many  items  will  be 
handled  to  produce  the  file  operation.   By  defining  an  item  time 
in  this  way,  any  explicit  real  time  measure  (i„e.9  number  of 
seconds)  is  eliminated  but,  in  so  doing,  one  must  avoid  the 
impression  that  item  times  for  different  operations  and  different 
allocation  schemes  are  the  same  with  respect  to  real  time.   This 
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is,  of  course,  not  true  in  general.   To  give  some  feeling  for  the 
real  time  measure  of  an  item  time,  simple  skeleton  programs  for  the 
basic  file  operations  will  be  given  for  roost  storage  allocations. 
The  program  steps  may  then  be  weighted  in  terms  of  operating  times 
of  a  particular  computer.   The  search  times  which  are  computed,  then, 
are  just  a  count  of  the  number  of  file  items  which  must  be  retrieved 
from  memory  to  perform  the  desired  task. 

If  the  program  and  intermediate  operating  data  are  stored  in 
the  same  memory  as  the  file,,  the  real  time  value  of  an  item  should 
include  the  number  of  retrievals  of  both  instructions  and  data  re- 
quired to  perform  one  iteration  of  the  program.   If,,  however,  the 
program  and  intermediate  parameters  are  stored  in  an  auxiliary  high- 
speed (relative  to  the  file,  storage)  memory,  it  may  be  safe  to  ignore 
their  effect.   In  this  latter  case  search  times  for  different  alloca- 
tions and  different  operations  may  be  directly  compared. 

The  search  times  and  storage  ratios  computed  for  each  alloca- 
tion are  summarized  in  Table  2,pp  47, 48. There  are  three  entries  for 
each  search;   the  maximum,  minimum,  and  expected  number  of  item 
times  to  complete,  the  search.   Frequently  the  entries  aie  approxi- 
mations of  the  expressions  which  have  been  derived .   When  applica- 
ble, the  u  or  £  for  which  the  entries  are  calculated  are  shown; 
for  most  other  u  and  £  the  search  reduces  to  the  corresponding 
search  of  the.  random  allocation.   Table  3  repeats  the  average  values 
of  Table  2  but  with  typical  values  chosen  for  the.  file  parameters. 
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A.    Random 

The  simplest  storage  allocation  system  is  that.  In  which  the  items 
are  stored  in  a  random  fashion  in  consecutive  memory  locations. 
Thus,  there  is  no  structure  to  this  organization  and  it  probably 
does  not  deserve  being  classified  as  an  allocation  system.   It  is 
important  to  consider  it,  however,  because  it  serves  as  a  basis  of 
comparison  for  other  systems,  the  cost  of  a  particular  allocation 
system  being  considerably  less  than  the  costs  involved  here  if  the 
other  system  is  at  all  efficient.   Moreover ,  many  allocations  are 
designed  to  achieve  maximum  efficiency  for  one  or  two  particular 
operations  and  may  be  considered  to  be  random  structures  with  respect 
to  other  operations.   An  important  example  of  this  is  the  smaller 
searches  performed  with  respect  to  a  permutation  other  than  the 
identity  permutation. 

Searching  for  a  match  in  the  random  structure  consists  in  exam- 
ining the  items  one  by  one  until  the  desired  item  is  found  (program 
Al) ;  hence,  the  maximum  search  time  occurs  when  the  matching  item 
is  the  last  one  of  the  file  and  thus  requires  n  item  times,  the. 
minimum  time  is  1  when  the  first  item  matches,  and  the  expected 
time  is  n/2. 

For  a  multiple-match,  x  wrt  u,  all  items  must  be  tested 
(program  A2)  as  there  is  no  a  priori  way  of  determining  how  many 
items  satisfy  the  criterion. 

Similarly,  it  is  necessary  to  test  all  items  to  determine  the 
smallest  item  or  the  item  next  smaller  than  x  wrt  u  wrt  £ 
(program  A3  and  A4). 
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Adding  an  item  to  the  file  is  particularly  simple,  for  it  is 
inserted  at  the  end  of  the  file.   Deleting  an  item  is  only  slightly 
more  complicated:   the  item  is  located  and  replaced  by  the  last  item 
on  the  list  (replacement  is  necessary  to  preserve  the  solid  nature 
of  the  storage.)   Thus,  deletion  item  times  are  just  the.  match  search 
times  plus  one. 

As  no  additional  structural  information  is  required  for  this 
allocation  scheme,  r  =  1 „ 

B0    Ordered 

In  an  ordered  file  the  items  have  been  sorted  by  some  external 
means  into  an  increasing  order  with  respect  to  some  u  and  are  stored 
in  consecutive  locations  of  the  memory „ 

A  dictionary  provides  a  simple  example  of  such  a  file.   To 
locate  a  specified  word  the  thumb  index  is  used  to  find  a  location 
near  the  desired  entry,  then  the  catchwords  at  the  top  of  the  pages 
are  used  to  narrow  the  number  of  candidate  items  to  a  few  dozen,  and 
finally  an  entry-by -entry  search  is  made  to  find  the  required  entry „ 
Thus,  in  general,  the  search  for  a  match  in  an  ordered  file  may  be 
conducted  by  successively  finer  scans  of  the  file,  the  first  scan 
determining  a  "general  area,"  the  succeeding  scans  reducing  the 
number  of  candidates  until  an  exhaustive  search  is  performed „ 
Program  Bl  shows  such  a  search  in  which  the  elements  of  the  vector 
c   determine  the  coarseness  of  the  scans . 

Obviously,  the  minimum  search  time  is  1,  wherein  the  first 
item  tested  matches.   The  maximum  number  of  items  which  may  be 
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tested  in  the  first  scan  is   ri/c  |   =  g  ,  the  maximum  on  the 
second  scan    c,  /c_  I   =  g   etc.,  so  that  the  overall  maximum 


search  time  is 


If  n  =  d  and  £  =  (d  ,d  ,..„,1),  then  the  maximum  time  is  hd . 
An  estimate  of  the  expected  search  time  is  found  by  assuming  only 
one  half  the  maximum  number  of  items  are  tested  on  each  scan. 

Another  search  technique  for  the  ordered  files  closely  related 
to  the  "dictionary"  search,  is  the  binary  search  in  which,  at  each 
access  to  the  file,  the  number  of  possible  candidate  items  is 
successively  halved  until  either  the  retrieved  item  matches  or  only 
one  candidate  remains  (program  B2) .   Again  the  minimum  search  time 
is  1„   The  maximum  time  occurs  when  the  last  item  tested  matches; 
this  time  is  llog_(n+l)|   .   To  determine  the  expected  search  time, 
t,  notice  that  on  the  initial  entry  to  the  file  only  one  item  may 
be  tested,  on  the  second  entry  either  of  two  entries  is  tested,  one 
of  four  on  the  third,  and  one  of  2    items  is  tested  on  the  1th 
entry.   Thus, 


t  =  ^     (probability  of  selection  of  ith  item)  X  (number  of 

entries  to  reach  the  ith  item) 


/.  =  ! 
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where  g  =   I  log  (n  +  1)  J  .     If  n  =  28  and  28»  g,  the  expected 
search  time  becomes  g  -  1.   Note  that  this  g  is  not  the  same  factor 
as  used  later  in  the  discussion  of  trees. 

The  search  for  multiple  matches  of  u/x  is  not  difficult  if 
the  mask  vector  is  a  prefix  of  the  vector  on  which  the  items  are 
ordered:   A  match  search  is  conducted  to  find  any  item  satisfying 
u/x;  successive  neighboring  items  "below"  this  item  are  retrieved 
until  an  item  fails  to  satisfy  the  u/x  match;  similarly,  items  "above" 
the  initial  item  are  retrieved  (program  B3) „   The  various  search 
times  are  thus  the  appropriate  match  search  time  plus  the  number  of 
items  satisfying  the  match  plus  two  (two  items  retrieved  do  not 
satisfy  u/x).   If  u  is  not  a  prefix  vector,  the  search  is  essen- 
tially that  of  the  random  structure  although  modifications  may  be 
made  to  account  for  leading  l's  in  u. 

Similarly,  the  search  for  the  smallest  item  and  the  item  next 
smaller  than  a  given  item  become  trivial  if  the  search  is  with 
respect  to  the  u  on  which  the  items  are  ordered.   If  this  is  not 
the  case,  the  file  must  be  considered  just  a  random  listing. 


1 
The  closed  forms  of  all  nontrivial  summations  are  given  in 
the  appendix. 
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To  add  an  item  to  an  ordered  file,  a  match  search  is  performed 
to  determine  its  proper  position  and  then  other  items  are  moved 
to  allow  for  its  insertion.   If  the  location  of  the  top  of  the  file 
is  fixed,  i.  e.,  not  subject  to  modification,  all  of  the  items 
greater  than  the  new  item  must  be  moved  down  to  allow  for  its  inser- 
tion.  Thus,  at  most  n  items  must  be  moved  (new  item  is  the  smallest 
item),  at  least  no  items  must  be  moved  (new  item  is  the  greatest), 
and  an  average  of  n/2  items  must  be  moved.   If  neither  the  top  nor 
the  bottom  of  the  file  is  fixed,  the.  maximum,  number  of  items  which 
must  be  moved  is  n/2  and  the  average  number  of  n/40   Hence,  to 
obtain  approximate  times  for  the  addition,  of  an  item,  the.  respective 
match  search  times  are  added  to  either  n,  n/2,  n/4,  or  0,  as  appro- 
priate.  Deletion  times  are  determined  in  an  analogous  manner. 

As  only  the  file  items  themselves  are.  stored,  w  =  hb  and  r  =  1. 

C.   Computable 

A  storage  allocation  of  a  file  in  which  the  location  of  an  item 
is  a  function  of  a  subset  of  its  elements,  i0  e., 

Location  (s_)  ■  f(u/s) 
is  called  a  computable  allocation.   Frequently,  f  is  a  linear  function 
although  it  generally  need  not  be. 

A  common  example  of  a  computable  allocation  is  the  storage  of 

a  matrix  A  in  which  the  location  of  the  coefficient  A,,  is  a 

-ij 

linear   function  of   i   and   j ; 

location(A.  .)   =    <X  a  -f  /S  |  +  / 
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The  common  practice  is,  of  course,  to  store  only  the  coefficient 
A,.,  not  the  3-tuple  (i,j,A  ),  at  the  designated  location. 

This  practice  of  not  physically  storing  in  memory  those  elements 
of  the  item  used  as  the  arguments  of  f  is  advantageous  in  that  a 
shorter  word  length  may  be  used.   This  gam  may  be  compensated  by 
a  loss  in  efficiency  in  some  search,  however.   For  example,  if 
matrix  A  is  searched  to  find  its  smallest  coefficient,  it  may  be 
difficult  to  determine  the  associated  indices  i  and  j ,  if  they  are 
not  explicitly  stored <>   If  such  searches  are  infrequent,  this  dis- 
advantage is  insignificant,  and  then  it  is  necessary  to  store  only 
the  I/s  portion  of  each  item  s.   Thus,  the  peculiarity  of  r(bits) 
being  less  than  unity  arises,   indeed,  if  5/s  has  z  elements, 

r(blts)  =  nbh  =  h  * 

The  use  of  the  word  "search"  in  a  computable  allocation  seems 
out  of  place,  for  an  item  is  located  by  simply  computing  f(u/s). 
Nevertheless,  the  time  to  compute  f  and  retrieve  the  associated 
item  shall  be  considered  to  be  the  match-search  item  time.   This 
time  has  been  designated  f  rather  than  1  in  Table  2  to  distinguish 
computation  time  from  retrieval  time;  the  relative  magnitude,  of  f 
with  respect  to  1  is  not  restricted „   (In  the  matrix  example,  it 
may  be  necessary  to  retrieve^  £,  and  %     from  the  same  memory  as  the 
items  themselves,  wherein  f  would  be  about  3;  or  o(    (2  f    and  y 
might  be  stored  in  a  high-speed  memory  especially  reserved  for 
such  constants,  wherein  f  might  be  much  less  than  1.) 
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A  multiple -match  with  respect  to  v  involves  computing  f  for 
all  values  v/u/S.   (In  the  matrix  example,  retrieving  all  the  items 
of  a  particular  row  is  a  multiple-match  search.)   The.  multiple- 
match  time  is  a  computation  times,  where  s  is  the  product  of  the 
orders  of  the  sets  v/u/S. 

Searches  for  the  smallest  item  and  the  item  next  smaller  than 
a  given  item  also  involve  one  computation. 

Notice  that  the  preceding  arguments  have  been  based  on  the 
premise  that  the  search  arguments  are  included  in  the  domain  of  f, 
viz.  in  u/S,  and  that  the  elements  of  the  domain  are  all  known.   If 
this  is  not  true,  the  computable  allocations  reduces  to  a  random 
allocation.   Moreover,  the  function  f  may  be  given  arguments  outside 
its  domain  and  an  ill -defined  request  may  result  in  a  legitimate,  item. 
In  other  words,  the  computation  of  the  location  should  be  a  two-step 
process?   It  must  first  be  ascertained  if  the  argument  u/s  is  in 
the  domain  of  f;  then,  if  it  is,  f(u/s)  is  computed.   For  example, 
if  A  is  an  8 -by -8  matrix,  and  the  value  of  A  q  is  requested,  one 
might  obtain  the  value  of  A,    without  any  error  indication 0 

The  addition  or  deletion  of  an  item  to  a  computable  allocation 
is  very  difficult,  because  a  completely  new  f  must  be  constructed  in 
most  cases  (possibly  involving  a  reallocation  of  many  items  also). 
This  may  be  circumvented  when  deleting  an  item  by  retaining  f  but 
inserting  a  dummy  item  in  place  of  the  one  being  deleted.   If  che 
dummy  item  is  retrieved  in  some  search  thereafter,  it  is  ign.-red. 
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D.    Chained 

In  a  chained  allocation  with  each  item  of  the  file  is  associated 
the  addresses  of  other  items  of  the  file.   Thus  when  one  item  ls 
retrieved,  information  is  supplied  as  to  how  to  locate,  other  items 
of  the  file.   Chaining  items  together  in  this  way  eliminates  the 
need  for  storing  the  file  in  consecutive  memory  locations s  greatly 
simplifies  the  problems  of  adding  or  deleting  items,  and  results  in 
reasonable  search  times. 

The  chained  allocation  has  search  times  comparable,  to  the  random 
allocation  if  the  items  chained  together  do  not  reflect  any  structure 
inherent  in  the  file.   If,  however,  items  with  common  characters  are 
chained  together,  the  search  times  can  be  substantially  reduced. 
This  is  accomplished  by  chaining  together  items  with  similar  characters 
in  a  given  character  position.   Thus  if  the  chaining  is  within  set 
S  ,  i.  e.  the  ith  character  position,  and  the  set  has  m  distinct 
elements,  there  will  be  m  chains  each  of  (average)  length  n/m„   An 
example  of  a  chained  allocation  is  shown  in  Fig.  4. 

When  searching  for  a  match,  then,  that  chain  which  agrees  with 
the  appropriate  character  of  the  argument  is  followed  and  each  item 
retrieved  is  tested  for  the  match  condition  by  comparing  it  with 
the  argument  item  (program  Dl),   The  maximum,  minimum,  and  expected 
search  times  are  n/m,  1,  and  n/2m  respectively. 

A  similar  procedure  may  be  followed  for  multiple -match  searches, 
but  the  entire  chain  must  be  traversed  to  insure  all  matching  items 
are  located  (program  D2). 
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For  these  searches  it  is  necessary  that  the  element  of  u 
corresponding  to  the  set  on  which  the  chaining  is  constructed  is 
1  (i.e.  u.  =  1  where  the  chaining  is  within  S„).   To  insure  this, 
it  may  be  necessary  to  have  sets  of  chains  on  several  different 
character  positions  so  that  there  is  at  least  one  chain  which  can 
be  followed  for  all  u  which  may  be  of  interest. 

To  search  for  the  smallest  item,  the  entire  chain  correspond- 
ing to  the  least  character  of  the  most  significant  character  position 
is  traced  (i.  e.  follow  the  chain  corresponding  to  the  least  char- 
acter of  set  S.  where  i  is  such  that  u.  =  1  and  for  all 
i  — l 

j  4   i  £t<  £.).    (Program  D3) 

The  search  for  the  item  next  smaller  than  a  given  item  _s  is 

similar,  but  the  chain  corresponding  to  the  same  character  of  the 

most  significant  character  position  is  followed  (i.e.  follow  the 

chain  corresponding  to  s,  where  i  is  defined  as  above).   If  it  should 

happen  that  £  is  the  least  item  in  this  chain,  the  chain  corresponding 

to  the  character  next  less  than  s.  is  followed  and  the  search  is 

— i 

for  the  largest  element  of  that  chain  (program  D4)„ 

Thus,  the  search  time  for  the  smallest  item  is  always  n/m, 
and  the  time  for  the  next  smaller  item  depends  on  whether  or  not 
it  is  necessary  to  "back  up"  one  character.   As  at  most  one  back-up 
is  required  (because  it  is  the  character  in  the  most  significant 
position  which  is  altered)  and  the  probability  that  it  is  required 
is  m/n  (because  there  are  n/m  items  in  each  chain),  the  maximum, 
minimum,  and  expected  search  times  are  .zLZL   .  Ji-  <j»i«/  (/+  -— J  — - 

respectively. 
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To  find  the  first  item  in  each  chain,  an  auxiliary  file  of 
starting  addresses  is  required.   This  is  a  listing  of  each  character 
of  each  position  which  is  chained  and  the  address  of  the  first  Item 
of  the  corresponding  chain.   Such  files  may  be.  random,  ordered,  or 
computable  allocations  as  may  be  appropriate.   The  time  for  searching 
the  auxiliary  file,  which  has  m  items,  has  been  ignored  in  the  search 
times  of  Table  2. 

Notice  that  the  chained  structure  is  similar  to  the  computable 
structure  in  that  some  characters  of  each  item  need  not  be  stored 
since  their  value  is  implicit  in  the  chaining.   Thus,  if  v  is  the 
logical  vector  with  l's  corresponding  to  the  sets  which  are  chained, 
only  the  elements  v/s  need  be  physically  stored „   No  loss  of  informa- 
tion is  incurred  in  doing  this  if  the  last  item  in  each  chain  is  con- 
nected to  the  auxiliary  file;  thus  the  appropriate  character  value 
may  be  ascertained  whenever  an  item  is  retrieved  via  another  chain. 
Although  eliding  more  than  one  character  position  may  lengthen  the 
search  times  considerably,  r(bits)  may  be  greatly  reduced,  especially 
if  a  K^K.   b.   It  is  assumed  no  elision  has  occurred  in  computing  item 
times. 

To  add  an  item  to  a  chained  file,  the  item  is  written  into  any 
available  location  and  the  chaining  addresses  modified.   A  simple 
way  to  modify  the  chain  addresses  is  to  consider  the  new  item  to  be 
the  first  in  each  chain  so  that  the  starting  list  addresses  point 
to  the  new  item  and  the.  addresses  previously  in  the  starting  list 
are  those  of  the  new  item.   Hence,  if  there  are  c  character  positions 
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which  have  chains,  c+1  items  must  be  handled  to  add  a  new  one 
(c  starting  items  plus  the  new  item). 

To  delete  an  item  it  is  necessary  to  locate  it  by  a  match 
search  and  then  replace  the  chaining  address  of  its  predecessor  by 
the  chaining  address  of  the  item  being  deleted.   The  item  must  be 
located  in  each  of  the  c  chains;;  following  just  one  chain  is  not 
sufficient  for  once  an  item  is  located,  its  successors  in  each 
chain  are  known,  but  not  its  predecessors.   Hence s  c  match -search 
times  are  required  to  delete  an  item. 

In  a  chained  allocation  each  word  must  contain  w  -  hb  +  ca 
bits,  hb  for  the  item  and  ca  for  c  chain  addresses.   Since  charac- 
ters which  are  chained  together  may  be  elided ,  it  is  possible  to 
reduce  the  bit  count  to  (h  -  c)b  +  ca.   In  addition  to  the  n  words 
of  the  file,  cm  words  are  required  to  hold  the  starting  addresses, 
so  that 

r (words)  =  1  +  — —  . 
n 

Similarly  it  is  found  that 

/v_  \    i  >  ca     cm(a  +  b) . 
r(bits)  -  1  +  —   +  -V~"^ 
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E.   Evaluation  of  Results 

Before  attempting  to  compare  the  relative  efficiencies  of  two 
allocation  systems,  one  must  be  certain  that  like  quantities  are  being 
compared „   One  of  the  bases  for  comparison  that  has  been  selected  is 
the  item  time,  which  essentially  reflects  the  number  of  file  items 
that  must  be  processed  to  achieve  the  desired  result.   It  does  not, 
however,  reflect  a  real  time  measure  and,  hence,  the  equality  of  two 
search  times  does  not  imply  equality  of  actual  processing  time. 
Therefore,  before  comparing  allocations  on  the  basis  of  the  data  of 
Table  2,  some  consideration  of  the  real  time  values  of  the  item  times 
must  be  made. 

The  second  basis  of  comparison  chosen  was  the  storage  capacity 
which  the  file,  plus  the  additional  allocation  structure,  required. 
Three  measures  of  storage  capacity  were  computed-   storage  ratios 
r(words)  and  r(bits)  and  the  number  of  bits  per  word,  w.   While  it  is 
mathematically  true  that 

r(bits)  =  r(words)  x  w/bh 

this  equation  may  not  be  true  in  a  practical  sense  depending  on  the 
characteristics  of  the  storage  medium  and  the  data  processor.   Suppose 
the  memory  and  processor  have  a  fixed  word  length  of  N  bits.   If 

N/2  <  w  <  N 
for  a  particular  file  and  application,  then  r(bits)  has  no  real 
significance  because  one  full  word  must  be  used  to  hold  the.  necessary 
allocation  data,  wasting  N  -  w  bits  of  each  word.   Bits  are  similarly 
wasted  if  N  <(  w  and  w  is  not  an  integral  number  of  words.   If, 
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however,  w  =  N/2,  and  the  computer  has  the  ability  to  process  half- 
words  efficiently,  two  allocation  items  may  be  packed  into  one  word. 
Conversely,  if  the  computer  has  a  variable  field-length  feature 
(and  many  computers  do),   r(words)  loses  significance  because  each 
allocation  item  uses  only  as  many  bits  as  necessary. 

Thus,  in  comparing  allocations  with  respect  to  required  storage 
capacity,  the  characteristics  of  a  particular  computer  should  be 
considered  to  determine  whether  r(bits)  or  r(words)  reflects  the. 
storage  efficiency  more  properly. 

It  is  by  no  means  clear  that  operating  time  and  storage  capa- 
city constitute  reasonable  measures.   Indeed  it  is  easy  to  improve 
one  at  the  expense  of  the.  other.   For  example,  a  random  allocation 
requires  essentially  no  maintenance,  has  both  r(words)  and  r(bits) 
as  small  as  possible,  but  has  very  long  search  times;;  at  the  other 
extreme,  duplicating  the  file  in  all  conceivably  useful  configura- 
tions is  extremely  wasteful  of  storage  space  but  permits  rapid  search- 
ing.  One  function  which  overcomes  this  type  of  difficulty  and  has  a 
reasonable  physical  interpretation  is  the  product  of  search  time 
and  storage  capacity.   The  product  may  be  thought  of  as  the  cost  of 
operating  the  system,  for  the  storage,  capacity  is  a  measure  of  the 
amount  of  equipment  required  and  the  search  time  is  a  measure  of 
the  amount  of  time  the  equipment  is  in  use. 

^.g.  IBM  7030,  7080,  705,  1401,  1620;  RCA  501 9  301  ;IInivac  1107. 
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The  nature  of  the  user's  problem  is  also  an  important  and 
obvious  factor  in  the  determination  of  relative,  efficiency.   If, 
for  example,  additions  to  or  deletions  from  the  file  are  rare,  it 
is  meaningless  to  use  these  times  as  a  basis  of  comparison.   If, 
however,  the  number  of  additions  to  the.  file  is  large 3   one  might 
consider  a  composite  allocation  of  a  main  ordered  file  and  a  sub- 
sidiary random  file,  which  is  searched  if  the  item  is  not  located 
in  the  main  file;  by  merging  the.  new  items  all  at  once  into  the  main 
file  when ,•  say,  the  random  file  becomes  large,  an  efficient  system 
may  be  developed. 

With  these  words  of  caution  in  mind,  let  us  now  summarize  the 
results  of  the  previous  parts.   Tables  2  and  3  will  be  of  assistance, 

The  random  allocation  has  no  desirable  qualities  other  than 
the  case  of  adding  items  and,  hence,  finds  little  application 
except  as  a  temporary  auxiliary  file,  as  noted  above. 

The  ordered  allocation,  which  is  probably  the  most  widely  used 
allocation  system,  is  quite  efficient  with  respect  to  both  storage 
and  search  time,  for  files  with  infrequent  changes.   If  the  rate  of 
change  is  not  low,  the  file  may  be  padded  with  dummy  items,  so  that 
the  deletion  of  an  item  is  accomplished  by  merely  designating  it  as 
a  dummy  and  the.  addition  of  an  item  will  involve  moving  only  a  few 
items.   If  a  dictionary-type  search  is  used  instead  of  the  binary 
search,  slight  rearrangements  of  the  proper  order  within  the  fine 
scan  range  may  be  tolerable,  thereby  permitting  even  few  item 
movements  when  adding  an  item„ 
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A  computable  allocation  is  also  quite  efficient  in  both  storage 
and  search  time,  but  the  file  items  must  have  very  special  properties 
to  allow  the  construction  of  a  reasonable  function.   Therefore, 
when  a  computable  allocation  is  possible,  it  probably  is  not  recog- 
nized as  such,  but  rather,  the  file  is  treated  as  a  mathematical 
entity  (e.g.  a  matrix  or  vector).   A  computable  system  is  frequently 
used  within  another  system,  e.g.  as  the  coarsest  can  in  a  dictionary 
search.   Often  the  function  evaluation  only  involves  the  use  of  index 
registers  and  indirect  addressing. 

Chained  structures  appear  rather  wasteful  with  respect  to  both 
search  time  and  number  of  bits  required.   However,  the  main  use  of 
such  structures  has  been,  not  for  searching  for  matching  items,  but 
for  linking  complex  data  or  program  chains  which  are  frequently 
changed.   Chained  structures  have  found  wide  use  in  symbol -manipula- 
ting, game -playing,  and  theorem-proving  programs. 
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6„    Representation  by  abstract  trees. 

The  abstraction  of  a  storage  scheme  which  is  a  combination  of 
two  or  more  of  the  basic  schemes  discussed  above  is  called  a  Tree . 
Organization  of  information,  or  keys  to  items  of  information ,  in  the 
form  of  a  tree  allows  utilization  of  certain  advantages  of  the  com- 
bined schemes  with  a  minimum  trade-off  in  terms  of  combination 
generated  disadvantages.   The  method  of  combination  and  the  variance 
of  tree  parameters  allow  the  programmer  an  opportunity  to  maximize 
the  desireable  features  of  a  given  combination,,   This  further  allows 
scientific  selection  and  provides  a  means  to  predict  performance. 

It  is  convenient  to  collect  definitions  used  in  the  following 
discussion  of  tree  structures.   The  terminology  varies  slightly  with 
various  authors  but  we  will  use  the  terms  adopted  by  Iverson  and 
Sussenguth0   T^J   L37J   Several  terms  are  illustrated  in  Figure  5. 

A  graph  comprises  a  set  of  nodes  and  a  set  of  associations 
specified  between  pairs  of  nodes.   If  node  i  is  associated  with  node, 
j,  the  association  is  called  a  branch  from  the  initial  node  i  to  the 
terminal  node  j .   A  path  is  a  sequence  of  branches  such  that  the 
terminal  node  of  each  branch  coincides  with  the  initial  node  of  the 
succeeding  branch.   Node  j  is  reachable  from  node  i  if  there  is  a 
path  from  node  i  to  node  j.   The  number  of  branches  in  a  path  is  the 


It  may  bother  some  readers  that  the  pictorial  representation  of 
trees  do  not  always  branch  upward  as  do  natural  trees.   Such  is  not 
the  case,  however,  and  the  literature  abounds  with  trees  "growing" 
every  which  way.   Branching  downward  seems  the  most  comfortable 
possibly  as  the  result  of  the  general  familiarity  with  the  organiza- 
tion chart  heirarchical  trees  (with  the  boss  on  top) 0 
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length  of  the  path.   A  circuit  is  a  path  in  which  the  initial  node 
coincides  with  the  terminal  node. 

A  tree  is  a  graph  which  contains  no  circuits  and  has  at  most 
one  branch  entering  each  node.   A  root  of  a  tree  is  a  node  which  has 
no  branches  entering  it,  and  a  leaf  is  a  node  which  has  nn  branches 
leaving  it„   A  root  is  said  to  lie  on  the  first  level  of  the  tree s 
and  a  node  which  lies  at  the  end  of  a  path  of  length  j-1  from  a  root 
is  on  the  j th  level.   A  tree  which  contains  n  roots  is  said  to  be 
n-tuply  rooted,  and  if  n  =  1  it  is  called  a  rooted  tree.   The  set  of 
nodes  which  lie  at  the  end  of  a  path  of  length  one  frcm  node  x 
comprises  the  filial  or  sib  set  of  node  x,  and  x  is  the  parent  node 
of  that  set.   The  nodes  within  a  sib  set  are  sometimes  referred  to 
as  sisters.   The  set  of  nodes  reachable  from  node  x  is  said  to  be 
governed  by  x  and  comprises  the  nodes  of  the  subtree  rooted  at  x„   A 
chain  is  a  tree  which  has  at  most  one  branch  leaving  each  node.   The 
number  of  branches  leaving  a  node  is  called  its  branching  ratio  or 
degree.   In  particular,  the  degree  of  each  leaf  is  zero, 

A  tree  is  said  to  be  uniform  if  all  nodes  on  level  j  are  utilized 
(or  "filled")  before  the  tree  is  expanded  to  include  nodes  on  level 
j  +  1.   Each  level  of  the  tree  corresponds  to  one  of  the  character 
sets  S.  of  a  file»   Such  a  tree  is  illustrated  in  Figure  6„   Only  the 
leaves  will  be  of  particular  interest  since  they  correspond  to  the 
items  of  a  file.   The  remaining  nodes  are  dummy  nodes  inserted  to 
fill  out  the  tree  for  search  purposes „ 
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That  set  of  nodes  on  level  j  which  are  the  leaves  of  the  subtree 
subtended  by  a  node  s_  on  level  j -k  is  called  the  kth  sib  set  of  node  s_ 
and  s_  is  called  the  kth  parent  of  the  sib  set.   In  general .,  the 
interest  will  be  in  first  sib  sets,  which  are  those  nodes  "directly 
reachable"  from  a  given  node  (the  parent  node) ,  and,  for  brevity, 
first  sib  sets  will  be  called  just  sib  sets  when  no  ambiguity  will 
result.   The  members  of  a  first  sib  set  are  called  s lb s „   In  a  uni- 
form d-way  tree,  (first)  sib  sets  have  exactly  d  members.   The  defini- 
tion of  sib  set  is  extended  so  that  the  set.  of  roots  constitutes  a 
sib  set  but  the  alternate  term  filial  set  is  probably  better  in  this 
case.   In  any  listing  of  the  members  of  a  sib  set,  some  node  must  be 
mentioned  first.     That  node  will  be  called  the  he  i r  node  of  the 
sib  set.   Sib  sets  are  important  because  tree  allocations  are  fre- 
quently characterized  by  the  manner  in  which  the  sib  sets  are  allocated 

In  order  to  show  the  necessary  memory  space  for  storage,  consider 
a  tree  of  height  h  (the  number  of  character  sets)  and  n  leaves  (the 
number  of  items  in  the  file)„   The  common  degree  d  (the  average 
number  of  characters  in  each  set)  is  such  that  d  =  n„   In  such  a 
tree,  the  kth  level  has  d  nodes.   Hence,  the  total  number  of  nodes 
in  a  tree  of  height  h  is  (See  Appendix);  | 


No  additional  ordering  relation  on  the  character  sets  is 
assumed  or  implied. 
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If  one  word  is  used  to  represent  one  node  and  if  n  is  assumed  larga, 

r (words)  =    d 
d  -  1 

for  the  tree  allocation  system.   However,  there  is  no  general  one-to- 
one  correspondence  between  nodes  and  words  in  computers  with  fixed 
word  length o 

The  nodes  of  a  tree  may  be  grouped  in  many  ways.   One  of  the. 
simpler  ways  is  to  chain  each  node  to  its  heir  (if  it  has  one). 
The  sib  sets  of  which  the  heirs  are  members  may  then  be  stored  in 
consecutive  memory  locations  (blocks)  in  a  random,  ordered,  or  com- 
putable manner  (Fig.  7).   In  addition,  the  sib  sets  themselves  may 
also  be  chained  together,  so  that  the  tree  has  two  types  of  chains 
associated  with  it*   one  joining  heirs  and  one  joining  sib  sets 
(Fig.  8).   A  third  method  is  to  store,  heirs  in  memory  blocks  and  to 
chain  the  sib  sets  (Fig.  9).   One  other  useful  chain  for  file- 
processing  is  a  chain  with  similar  characters  of  the.  same  character 
set  (tree  level).   See  Figure  10„   The  three  types  of  tree  chains 
are  called  heir  chains, sib  chains  and  character  chains  respectively0 

The  character  chains  are  superfluous  for  tree  structures  and, 

1 
hence,  need  be  included  only  to  speed  up  various  searches.   However, 

the  concepts  of  heir  and  sib  sets  are  essential  to  the  tree  structure. 

If  heirs  are  grouped  into  consecutive  memory  locations  and  the  sib 

sets  chained,  the  resulting  structure  strongly  resembles  Iverson's 


1 
See  page  75  for  search--  with  the  mask  vector  £4  not  a  full 

vector  or  a  prefix  vector. 
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right  list.   [l6j    Indeed,  it  is  exactly  a  right  list  if  the  heir 
blocks  are  stored  in  the  proper  order,  i.e.,  if  the  chain  is  determined 
by  the  right  list  degree  vector„   Similarly,  if  sib  sets  are  grouped 
into  blocks  and  the  heirs  chained,  the  structure  resembles  a  left 
list. 

Each  leaf  of  the  tree  corresponds  to  an  item  of  the  file.   The 
intermediate  nodes  are  dummies  inserted  to  fill  out.  the  tree  and  form 
a  path  which  is  traced  from  root  to  leaf  via  the  heir  and  sib  links. 

The  nodes  on  level  i  are  identified  with  set  S, ,  and  within  each  sib 

1 

set  on  level  i  there  are  enough  nodes  to  correspond  to  the  characters 

of  set  S.  o   There  are  at  most  m  in  each  sib  set,  and  there  are  d  on 
1  s 

the  average.  The  memory  word  or  words  corresponding  to  a  given  node 
contains  the  representation  of  the  character  associated  with  the  node 
(plus  any  chaining  addresses  and  data  that  may  be  needed).  The  label 
or  name  of  a  node  is  the  concatenation  of  the  characters  of  the  path 
leading  to  that  node  from  a  root.  Thus  the  label  of  a  leaf  is  just 
an  item  (In  Fig.  5  the  character  associated  with  a  node  is  underlined 
and  the  remaining  part  of  its  label  is  not). 


It  should  be  noted  at  this  point  that  the  labels  of  nodes  are 
not  always  simple  characters  or  concatenations  of  characters 
(alphabetic  or  numeric).   In  some  applications  of  tree  structures, 
the  labels  are  entire  words  or  strings  of  words „   The  problem  of  word 
length  is  more  acute  in  these  cases  but  the  principles  of  manipula- 
tion are  the  same.   Therefore  the  tree  discussion  will,  for  clarity s 
be  in  terms  of  simple  characters. 
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A.   Methods  for  computer  storage. 

For  the  purpose  of  computer  storage  of  a  tree  structure,  it 
is  useful  to  consider  the  label  or  name  of  each  node  and  its  associated 
location-relation  parameters  as  a  vector.   This  vector  is  then  con- 
sidered as  the  jth  row  of  a  chain  list  matrix.   See  Figure  11.   The 
nodes,  and  thus  the  rows  of  the  matrix,  may  be  listed  in  three 
different  orders:   1)  arbitrary  (random)  order,  for  chained  heirs 
and  chained  sib  set  type  trees,  2)  level  order,  for  chained  heirs  and 
block  sib  set  type  trees,  and  3)  subtree  order  for  block  heirs  and 
chained  sib  set  type  trees.   See  Figure  11 0   The  matrix  representa- 
tions of  tree  structures  in  Figure  11  include  additional  relationships 
which  serve  to  simplify  programming.   Indeed,  when  the  nodes  are 
listed  in  either  level  or  subtree  order,  the  second  and  the  third 
columns  (names  and  degrees)  of  the  matrices  L  and  S  may  be  shown,  to 
be  sufficient,   [32j 

The  big  advantage,  of  course,  to  the.  arbitrary  or  random  ordered 
matrix  is  that  it  is  easy  to  add  or  delete  nodes  and  effect  other 
structural  changes  as  required  in  many  dynamic  systems.   No  re- 
shuffling is  needed  at  any  time.   If  the  nodes  are  listed  in  level 
or  subtree  order,  much  of  the  structure  is  inherent  in  the  order  of 
listing  and  less  memory  is  wasted  on  structure  information.  However, 
a  severe  price  is  sometimes  paid  when  information  too  lengthy  for 
the  available  extra  space  is  inserted  thereby  requiring  relocation 
of  major  portions  of  the  matrix. 
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Chained  hoira  and 
chained  eib  acta. 

Ai   index 


Chain  liet  matrix  L 
in  level  order. 
Chained  he ire  and 
block  alb  eete.* 


L*  index 


A2  name  (or  label)     L2  name  (or  label) 

A.  root  on  previoua    L,  degree 
*     i~.—  i  t a j_\  y 


level  (parent  node) 

A^  heir  chain 

Ac  eib  eet  chain 
(viator  nodus) 


L^  heir  chain 


Chain  liet  matrix  S 
in  eubtree  order. 
Block  heira  and 
chained  eib  seta,* 

&l     index 

3 2  neme   (or  label) 

S*  degree 


eib  eet  chain 
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Ono  a.pace  i«s  left  vncnnt  at  the  end  of  each  block.   In  gonerel 
the  numbor  of  of  vocancioo  ia  not  ono.  Ruthor  it  io  e  decision 
on  the  part  of  the  programmer  whether  to  leave  ej>r«ce  for 
expansion  or  not,  and  how  much.   The  decision  ie  booed  on  the 
oharocterlatlce  of  the  file  and  ita  intended  use,  e.g.  oizo, 
froquency  of  aearch  va.  frequency  of  addition  and  deletion,  etc. 
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The  factor  of  computer  word-length  enters  into  the  consideration 
at  this  point.   As  we  have  previously  pointed  out,  computers  that 
have  variable  word  lengths  free  the  programmer  from  any  concern  over 
the  row  vector  fit.   When  dealing  with  fixed  word  length  machines, 
inefficiencies  will  be  inevitable  when  the  row  vector  length  is  not 
an  even  multiple  of  the  computer  word  length. 

Memory  access  systems  for  processing  variable  length  operands 
in  fixed  word  length  computers  have  been  proposed  as  a  solution  to 
this  problem.   Henderson  and  Flynn  have  suggested  a  scheme  which 
addresses  the  characters  of  the  memory  words  vice  the  words  them- 
selves „   J_7_J   A  control  triple  contains  the  character  address,  the 
number  of  characters  or  length  of  the  operand,  and  a  single  bit  which 
signals  right  or  left  justification.   Masking  and  shifting  operations 
are  performed  in  the  hardware  and  entail  no  time,  penalty. 

As  an  example  of  the  necessary  word  length  computation,  consider 

again  the  row  vectors  of  the  matrices  in  Figure  11 „   The  indices  in 

A,  ,  L,  ,  and  S   are  the  addresses  of  the  computer  words  containing 
111 

each  row  vector.   If  we  allow  the  names  of  nodes  to  be  any  alpha- 

*  b      ^ 
numeric,  then  b  =  6  bits  and  m  —  2  =  &4  £  d.   We  pointed  out 

previously  (p. 60  )  that  in  some  applications  of  trees  the  names 

of  the  nodes  are  more  complex  than  a  single  alphanumeric  character. 

In  these  cases  the  character  may  be  replaced  by  the  address  of  a 

list  which  may  be  any  length  >  1.   See  Case  2  below; 
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Case  1:   Alphanumeric  name 

A       6 
A;    115 

AA      15 
A  _15 

51  bits 

Case  2 :   Complex  name 

A.    15 
A2    15 

a;   15 

A*    15 


6 

6 
15 
27  bits 


6 

6 

JL5 

2  7  bits 


15 
6 
15 
36  bits 


15 

6 

11 
36  bits 


60  bits 


Two  distinct  cases  occur  when  dealing  with  multi-tree  structures, 
The  first  is  the  case  in  which  all  the  nodes  are  considered  distinct. 
The  second  is  the  case  in  which  many  nodes  exist  in  more  than  one 
rooted  subtree.   Note  that  the  nodes  within  each  rooted  subtree  are 
considered  distinct.   Due  to  the  greater  complexity  of  the  minimum 
matrix  representation  for  the  second  case,  its  use  should  be  con- 
sidered only  when  the  number  of  nodes  which  are  unique  to  one  given 
tree  is  less  than  one  half  the  total  number  of  nodes.   L^j 

A  heirarchical  tree  is  an  example  for  which  it  is  expected  that 
each  node  will  appear  only  once,  while  syntactical  trees  representing 
sentence  structures  is  an  example,  for  which  a  large  number  of  nodes 
(words)  will  be  repeated. 

The  problem  of  non-distinct  or  overlapping  nodes  is  basic  in 
the  application  of  list  and  tree  structures.   One  of  the  important 
merits  of  list  processors  is  that  data  having  multiple  occurrences 
often  need  not  be  stored  more  than  one  place  in  the  computer.   One 
may  visualize  this  situation  as  the  overlapping  or  intersection  of 
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lists,  and  its  utilization  may  be  regarded  as  a  step  toward  global 
optimization  of  storage  allocation. 

However,  overlapping  lists  pose  a  problem  when  it  becomes 
necessary  to  erase  a  particular  list  or  node  and  return  the  space 
to  the  list  of  available  space  (LAS).   When  a  given  list  is  no  longer 
needed,  it  is  desired  to  erase  just  those  parts  that  do  not  overlap 
other  lists  still  in  use.  There  is  no  general  method  of  doing  this 
short  of  making  a  survey  of  all  lists  in  memory.   F 3 j 

Several  solutions  to  this  problem  have  been  found,  each  with 
its  own  disadvantages  or  limitations.   A  solution  incorporated  by 
McCarthy  in  his  LISP  merely  abandons  lists  as  they  are  no  longer 
needed.  j"22J   Then,  when  the  LAS  has  been  depleted,  a  survey  is 
made  of  all  registers  currently  employed  in  non-abandoned  lists  and 
the  complement  of  this  set  is  returned  to  the  LAS. 

There  are  two  sources  of  inefficiency  in  this  method.   Depend- 
ing on  the  computer,  the  application,  the  programmer's  skill,  etc., 
the  time  needed  to  reclaim  unused  storage  space  is  nearly  independent 
of  the  amount  of  space  reclaimed.   The  efficiency,  therefore,  drops 
off  rapidly  as  the  memory  approaches  capacity.   Second,  the.  method 
as  used  by  McCarthy  required  that  a  bit  (in  this  case,  the  sign  bit) 
be  reserved  in  each  word  for  tagging  accessible  registers  during  the 
survey  of  accessibility.   If  data  consists  of  signed  integers  or 
floating-point  numbers,  this  results  in  awkwardness  and  further 
loss  of  efficiency.  This  method  can  be  modified  by  setting  aside  a 
block  of  memory  and  establishing  a  one-to-one  correspondence  between 
the  bits  in  this  block  and  the  words  in  the.  remaining  memory.   This 
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modification  eliminates  the  awkwardness  of  arithmetic,  but  takes 
a  further  toll  in  the  speed  of  reclamation.  \_3^ 

A  second  method,  used  by  Gelernter  et  al  in  FLPL,  consists  of 
adopting  the  convention  that  each  list  component  is  "owned"  by 
exactly  one  list,  and  "borrowed"  by  all  other  lists  in  which  it 
appears.   [_10j   A  "priority  bit"  is  then  used  in  each  "location 
word"  to  indicate  whether  the  list  entry  referenced  is  owned  or 
borrowed.   An  erasing  routine  is  used  which  erases  only  those  parts 
of  a  list  that  are  owned.   This  system  obviously  works  only  so  long 
as  no  list  is  erased  that  owns  a  part,  that  has  been  borrowed  by 
some  other  list  not  yet  erased. 

A  third  method  has  been  developed  by  Collins  as  a  means  to 
overcoming  the  disadvantages  of  the  first  two.  \_  3 J   The  method  is 
based  on  the  interspersion  of  words  containing  reference  counts. 
Viewed  in  terms  of  the  conventional  diagrams  of  lists,  such  a 
reference  count  is  a  tally  of  the  number  of  arrows  point  _to  the 
box  containing  the  reference  count  and  emanating  from  boxes  in  the 
same  or  other  lists.   See  pp.  5 • ,  6  .   A  two-bit  type  code  then 
appears  in  each  location  word  ,  one  value  of  which  serves  to  identify 


Each  list  is  represented  by  a  set  of  location  words  and  data 
words.   Data  words  contain  atoms  and  consist  of  a  single  field. 
Location  words  are  divided  into  fields  which  contain  a  type  code, 
the  address  of  the  next  location  word,  and  a  field  which  may  contain 
any  one  of  four  things,  determined  as  follows  by  the  type  codes 


0  Location  of  an  atom 

1  Atom  (if  max.  no.  of  bits  small  enough) 

2  Location  of  a  list 

3  Reference  count 
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a  reference  count  as  such.   The  obvious  convention  is  adopted  that, 
unless  a  word  contains  a  reference  count,  there  is  exactly  one 
arrow  pointing  to  it.   A  reference  count  of  one  is  thereby  redundant 
but  otherwise  harmless. 

A  disadvantage  of  this  system  is  that  atoms  cannot  be  borrowed 0 
Often  this  will  not  be  a  significant  encumbrance.   When  it  is,  it 
can  be  partially  solved  by  uniformly  replacing  each  atom,  a,  by  a 
list  whose  only  term  is  a. 

Finally,  a  method  slightly  different  from  the  solutions  already 
given  is  the  use  of  an  auxiliary  aid  to  keep  tabs  on  distinct  nodes 
and  successor  pairs.  This  aid  is  called  a  connection  matrix.   A  clear 
description  of  this  matrix  is  too  lengthy  to  be  given  here.   Suffice 
to  say  that  it  exists  apart  from  the  chain  list  matrix  which  describes 
the  tree  but  the  chain  list  matrix  in  this  case  contains  only  distinct 
nodes.   See  j~ 32 J  for  a  brief  description  and   £.19]    f°r  a  more 
rigorous  treatment. 
B.    Search  operations  for  trees. 

We  previously  discussed  search  operations  when  data  was  stored 
according  to  any  one  of  the  four  basic  schemes,  i.e.  random,  ordered, 
computable  and  chained  allocations „   The  same  searches  may  be  applied 
to  the  tree  allocations  with  minor  modifications  since  the  tree 
allocations  employ  various  combinations  of  the  basic  allocations. 
We  shall  again  consider; 
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Match  search 

Multiple -match  search 

Smallest  search 

Next  smaller  search 

Add  an  item 

Delete  an  item 
It  will  be  assumed  that  the  tree  under  consideration  is  uniform  for 
the  derivation  of  efficiencies  but  this  restriction  does  not  apply 
to  the  search  procedures  themselves.   As  in  Section  4,  the.  treatment 
here  is  adapted  from  Sussenguth  with  minor  modifications,,   ^36J 

To  conduct  a  match   search  in  a  tree  allocation,  one  first 
searches  the  sib  set  of  roots  to  locate  the  node  whose  associated 
character  is  the  same  as  the  first  character  of  the  argument.   Then 
one  proceeds  to  the  first  sib  set  of  this  node  and  searches  again 
to  locate  the  node  with  character  matching  the  second  character  of 
the  argument.   This  process  continues  through  level  h.   The  manner 
in  which  the  sib  sets  are  searched  and  the  manner  in  which  the  next 
sib  set  is  located  depend  upon  the  particular  allocation. 

Consider  first  the  allocation  in  which  heirs  are  chained  and 
the  sib  sets  are  computable  allocations  within  blocks.   For  a 
match  search,  h  words  must  be  retrieved  from  memory,  one  for  each 
level.   The  address  of  the  root  word  is  computed  for  the  first 
element  of  the  argument  and  an  initial  address  parameter.   The  root 
word  will  contain  the  parameter  required  to  locate  its  sib  set,  and 
this  parameter  and  the  second  element  of  the  argument  are  used  to 
compute  the  address  of  the  word  corresponding  to  the  proper  node 
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on  the  second  tree  level.   This  process  is  repeated  for  all  J>  levels 
(program  Tl),  thereby  entailing  h  memory  accesses  and  h  function 
evaluations. 

As  a  simple  example  of  such  an  allocation  consider  the  character 
sets  to  be  the  integers  from  0  to  m-1,  and  let  the  sib  sets  be  stored 
in  order.   At  each  node  is  stored  the  address  c<  of  its  hei.r  node. 
Thus,  if  the  node  of  the  kth  level  has  just  been  retrieved ,  the 
address  of  the  node  of  the  (k+l)th  level  is  just  o<    +   j?     where 
x  is  the  argument  of  the  search. 

An  allocation  similar  to  this  is  that  in  which  the  heirs  are 
chained  but  the  sib  sets  are  random  allocations  within  blocks.   The 
chaining  from  node  to  heir  is  accomplished  by  extracting  the  chain 
address  which  is  stored  at  each  node.   The  proper  node  of  the.  sib 
set  is  obtained  by  comparing  the  stored  character  with  the.  appro- 
priate character  of  the  argument;  if  they  match,  the  proper  node 
of  the  sib  set  has  been  found;  if  they  do  not  match ,  the  next  memory 
word,  which  is  the  next  node  in  the  sib  set,  is  tested.   When  the 
proper  node  has  been  found,  its  heir  is  located.   This  is  repeated 
for  all  h  levels  (program  T2)»   Thus,  the  search  time  is  composed 
of  the  h  item  times  needed  to  pass  from  root  to  leaf  via  the  heir 
chain  plus  the  item  times  needed  in  the  sib  set  searches.   These 
are  at  most  d-1  per  set,  at  least  none  per  set,  and  (d-l)/2  per 
set  on  the  average. 

The  search  procedure  for  the  allocation  in  which  both  the  heirs 
and  the  sib  sets  are  chained  is  exactly  the  same  as  the.  search  just 
described  except  that,  when  searching  the  sib  sets,  the  sib  chaining 
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address  is  used  to  locate  the  next  node  rather  than  consecutive 
memory  locations.   (Notice  that  program  T3  is  program  T2  but  for 
one  line)0 

Because  these  three  allocations  are  so  similar,  the  details  of 
the  remaining  searches  will  be  discussed  only  for  doubly -chained 
trees. 

The  items  satisfying  the  multiple -match  search  when  u  is  a 
prefix  vector  are  those  items  corresponding  to  the  subtree  subtended 
by  the  node  with  label  u/x.   Program  T4  delineates  such  a  search: 
First,  the  node  u/x  is  located  by  a  match  search.   The  leaves  of 
its  subtree  are  found  by  finding  the  heir  node  of  u/x,  then  the  heir 
of  the  heir,  etc.,  until  the  heir  which  is  a  leaf  is  found;  its  sibs 
are  retrieved.   Then,  it  is  necessary  to  "back-up"  one  level  to  the 
parent  node  of  the  sib  set  just  retrieved,  find  the  next  node  of 
the  sib  set  of  that  level,  move  "forward"  to  its  heir,  and  retrieve, 
its  associated  sib  set.   This  process  is  repeated  until  the  entire 
subtree  has  been  retrieved.   Thus,  if  the  u/x  node  is  in  level  h-k 

(i.e0  u  has  h-k  l's)  the  d  subtended  leaves  are  retrieved  by 

k-1 
locating  d     sib  sets;  the  parents  of  these  sib  setss  in  turn, 

k-2 
form  d    sib  sets  themselves,  and  so  on  back  to  the  u/x  node. 

Thus  to  retrieve  the  s  =  d   subtended  leaves, 

XA  o*    /*+A*JI     MM/) 

items  must  be  handled.   (The  factor  of  2  occurs  because  each  non- 
leaf  node  is  retrieved  twice,  one  in  the  forward  direction  and  cnce 
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in  the  backup.)   The  match  search  time  to  retrieve  the  u/x  node  must 
be  added  to  this  to  determine  the  multiple -match  search  tiime. 

To  accomplish  this  search,  it  is  necessary  to  locate  the  parent 
of  a  given  node;  the  chaining  structure  introduced  thus  far  does  not 
permit  this.   However,  a  simple  modification  can  provide  this  facility 
The  last  node  in  each  sib  set  has  nothing  chained  to  It  in  the  sib 
chain;  therefore,  its  address  portion  is  vacant  and  may  be  used  to 
indicate  the  location  of  the  parent  of  the  sib  set.   To  avoid  an 
ambiguity   of  having  a  "backward -heir"  chain  address  in  the  portion 
of  the  word  assigned  to  sib  chains,  it  is  necessary  to  add  an  extra 
marker  bit  to  the  sib  chain  addresses  which  indicated  to  which  chain 
the  address  belongs.   These  "backward -heir"  or  parent  chains  are 
indicated  by  two  broken  lines  in  Fig.  8    and  the  marker  bit  is  denoted 
by  the  symbol  A  in  the  programs. 

The  tree  allocations  with  no  sib  chains  cannot  have  the  back-up 
property  as  described.   Frequently,  however,  this  may  be  circumvented 
by  additional  bookkeeping  in  the  search  programs,  which  may  entail 
a  considerable  expenditure  of  time  with  respect  to  the  retrieval  of 
a  single  address.   If  there  is  sufficient  word  length  available,  an 
additional  link  may  be  included  to  indicate  the  parent  node  as 
illustrated  in  Figure  11,  Column  A,. 


There  is  no  ambiguity  if  the  number  of  nodes  in  the  sib  set  is 
known.   This  will  not,  however,  be  the  case  in  general. 
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The  smallest  search  in  the  doubly-chained  tree  allocation 
proceeds  exactly  as  the  match  search,  except  that  the  smallest;  charac- 
ter in  each  sib  set  is  selected  rather  than  a  matching  character, 
(program  T5) „ 

To  search  for  the  item  next  smaller  than  a  given  item  is  more 
complex,  however^  One  first  locates  the  given  item;  then  its  sibs 
are  searched  to  find  a  smaller  item.   (Specifically,  since  the  words 
contain  only  a  single  character,  the  sib  set  is  searched  for  the 
character  next  less  than  the  least  significant  character  of  the  argu- 
ment).  If  such  an  item  exists,  the  search  is  completed.   If  not,  that 
is  if  the  given  item  is  the  smallest  in  the  set,  the  parent  node  must 
be  retrieved  and  its  sibs  searched  for  a  smaller  character.   If  one 
is  found,  its  first  sib  set  is  searched  for  its  largest  item  and 
the  search  ends.   However,  it  may  still  be  necessary  to  back-up 
another  level.    Thus,  if  k  back-ups  are  required,  the  search  is 
composed  of  one  match  search, 2k  links  in  heir  chains  (k  in  each 
direction),  and  next  smaller  or  largest  searches  through  2k  +  1  sib 
sets  (the  original  sib  set  of  leaves,  k  searched  for  the  next  smaller 
item,  and  k  for  the  largest), 

Since  about  d  item  times  are  required  to  search  a  sib  set, 
the  search  times  may  be  calculated  once  the  appropriate  values  for 
k  are  known.   Clearly,  the  minimum  k  is  0,  and  the  maximum  h  -  1 
(the  argument  is  the  smallest  item  of  the  file).   To  ascertain  the 


In  the  chained  allocation  at  most  one  back-up  was  needed  because 
the  entire  item  could  be  examined  and  the  most  significant  character 
changed.   Here  several  back-ups  may  be  needed  because  only  one  character 
may  be  examined  at  a  time. 
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average  k,  consider  the  character  sets  to  be  the  Lntegers  between 

0  and  d-1.   In  the  file  there  are  d    -  d      items  with  exactly 

k  trailing  O's,  and  for  each  such  item  k  back-ups  are  necessary.   Thus 


'tTV 


avg   ^*/.<d"   -\*    *)i 


A-/   , 

=        1 


d   -   1* 


In  heir -chained  tree  allocations  the  procedures  for  adding 
or  deleting  items  are  analogous  to  the  procedures  for  computable, 
random,  or  chained  files  according  as  the  sib  sets  are  computable, 
random,  or  chained. 

The  tree  allocation  schemes  as  described  thus  far  allow  efficient 
searches  only  when  the  mask  vector  u  is  either  the  full  vector  or 
a  prefix  vector.   For  example,  to  perform  a  match  when  u  s  (0,1,...), 
it  is  necessary  to  search  all  of  the  nodes  of  the  second  level  to 
locate  characters  matching  the  second  character  of  the  argument. 
To  test  all  the  second  level  nodes,  it  is  necessary  to  proceed  from 
a  root  to  its  sib  set,  back  to  the  root,  down  to  the  next  root,  to 
its  sib  set,  etc.   This  could  be  a  time -cons  timing  process,  especially 


See  appendix. 
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if  u  had  several  leading  O's. 

If,  however,  the  allocation  includes  character  chains  (Figure  10), 

arbitrary  u  vectors  are  permissible „   Consider  the  multiple -match 

search  x  wrt  u  for  a  tree  which  has  heir,  sib,  and  character  chains. 

Assume  u  has  l's  in  positions  u.-,,  u„,_  ,...,  u.,   and  Q"s  elsewhere. 
-  r        -jl'  -j 2  '    '  -jk 

The  character  chain  of  x.,   is  traced  to  its  first  nocfp  .   Then  the 
(j^.  "  j^  i)th  parent  of  that  node  is  found  by  proceeding  via  the 
parent  chain  for  j,  -  j   1  levels.   If  that  parent  node  does  not  have 
character  value  x   1 ,  this  portion  of  the  search  may  be  terminated, 
and  the  next  node  of  the  character  chain  tested.   If  the.  parent  node 
does  have  value  x.,  ,,  however,  find  its  parent  in  the  j,   th  level 
and  test  it  similarly.   If  the  original  candidate  node  is  such  that 
all  its  parents  in  levels  j- ,  :Lj,...,  j,  have,  character  values 
matching  x,  then  the  leaves  of  the.  subtree,  subtended  by  that  node 
constitute  part  of  the  set  of  items  satisfying  the  search  (program  T6). 
C.   Evaluation  of  results. 

Tree  allocation  systems  appear  both  efficient  and  versatile. 
Most  files  can  be  allocated  in  a  tree  structure  without  difficulty. 
One  set,  say  S1 ,  is  chosen  to  correspond  to  the  roots,  and  all  items 
having  the  same  first  character  are  placed  in  the  subtree  subtended 
by  the  root  corresponding  to  that  character.   Sib  sets  in  the  second 
tree  level  are  formed  from  the  characters  of  another  set,  say  £_, 
and  then  their  sib  sets,  through  h  levels.   If  the  file  items  are 
correlated  (i.e0  have  common  characters),  and  it  is  reasonable  to 
expect  that  such  correlation  would  exist  in  practical  applications, 
the  number  of  dummy  nodes  which  must  be  added  is  relatively 
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small.      Indeed,  for  many  trees  fewer  bits  are  required  to  store 
the  file  and  its  associated  structure  than  are  required  to  store  the 
file  on  an  item-by-item  basis.   This  apparent  paradox  vanishes  when 
it  is  observed  that  one  character  at  the.  root  of  the  tree  is  shared 
by  all  the  items  of  the  subtree  subtended  by  that  root.   Moreover , 
in  a  tree  allocation  it  is  possible  to  perform  all  of  the  operations 
considered  with  relative  ease.   More  complex  operations  (e.g.  search 
for  the  item  next  smaller  than  a  given  item  with  respect  tA   an  arbi- 
trary permutation)  can  be  accomplished  with  more  elaborate  programs. 

When  comparing  the  relative  efficiencies  of  the  tree  allocations 
the  same  considerations  regarding  like  quantities  as  were  discussed 
before  apply  (p. 43)   ,  i.  e.  time  and  required  storage  capacity  with 
due  regard  for  their  inverse  relationship.   Table  4  is  a  summary  of 
search  times  and  storage  ratios.   As  in  Table  2,  there  are  three 
entries  for  each  search g   the  maximum,  minimum.,  and  expected  number 
of  item  times  to  complete  the  search.   Frequently  the  entries  are 
approximations  to  the  expressions  which  have  been  derived.   When 
applicable,  the  u  and  p_  for  which  the  entries  are  calculated  are 
shown.   Table  5  repeats  the  average  values  of  Table  4  but.  with  typical 
values  chosen  for  the  file  parameters. 


It  is  easy  to  show  this  number,  which  is  n(l-r(words)) 9  is 
minimized  if  the  set  of  smallest  order  is  chosen  to  correspond  to 
the  roots,  the  set  of  next  smallest  order  to  the.  second  level,  etc 
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Tree 

Tree 

Tree 

computable 

sibs 

random 
sibs 

chained 
sibs 

Match 

18 

19 

19 

Multiple 
match 

54 

-166 

166 

Smallest 

18 

31 

31 

Next 

\ 

smaller 

.,   18 

>27 

27 

Add  item 

large 

21 

20 

Delete  item 

large 

21 

20 

Bits  per 
word 

15 

21 

37- 

Storage 
ratio  (words) 

1.24 

1.24 

1.24 

Storage 
ratio  (bits) 

•52 

.72 

1.27 

Parameters 

all 

a  « 15     h  r  6 

characters 

b  *  6    m  a  30 

implicit 

0  a  3    n  =20000 

f:2     8»100 

Comparative  item  times  for  the  typical 
parameters  shown. 


Table  5 


7.   Minimization  of  expected  search  time,  and  the  search-time, 
storage-space  product. 

We  have  seen  the  general  reduction  in  item  times  with  only  a 
slight  increase  in  storage  ratios  when  considering  tree  Allocations 
vs.  the  four  basic  allocations.   Note  Tables  2  and  3  vs.  Tables  4 
and  5,   The  item  times  given  in  Table  5  are  dependent  on  the,   selection 
of  the  typical  parameters  shown  just  beneath  the  table.   It  is  the 
purpose  of  the  following  discussion  to  show  some  calculations  based 
on  these  parameters  which  will  optimize  the  desired  characteristics 
of  the  tree  structure. 

The  parameters  shown  in  Table  5  are  defined  in  Table  i  and  are 
reiterated  here  for  convenience: 

a  ■  15    number  of  bits  to  specify  an  address 
b  =   6    number  of  bits  to  specify  a  character 
c  =  3    number  of  character  positions  which  are  chained 
f  =   2    number  of  item  times  to  compute  function  f 
h  =  6    number  of  character  sets  (height  of  the  tree) 
m  =  30     number  of  members  of  each  character  set 
n  =  20,000    number  of  items  in  the  file 

s  =  100    number  of  items  satisfying  a  multiple -match  search 
Parameters  a  and  b  are  characteristics  due  to  binary  coding 
and  the  machine  in  use.   c  and  f  are  included  to  allow  calculations 
for  the  "Tree  with  computable  sibs"  column.   We  disregard  these  new 
since  the  difficulty  with  add  item  and  delete  item  functions  make 
this  scheme  unfeasible  except  in  very  limited  applications,   n  a-M 
s  are  characteristics  of  the  file  used  for  comparison  and  are 
therefore  beyond  our  control.   This  leaves  h  and  ms  the  only  two 
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parameters  available  for  manipulation. 

If  there  are  m  nodes  in  a  filial  set,  the.  expected  number 
which  must  be  tested  before  a  match  between  a  node  value  and  the 
appropriate  element  of  the  query  is  %m.  Thus  the  expected  number 
of  chaining  links  required  to  find  a  match  within  one  filial  set 
is  %  (mfl);  one  link  to  reach  the  filial  set  and  %  (m  -  1)  to 
search  it. 

Assuming  the  search  time  is  proportional  to  the  number  ©f 

chaining  links  traversed,  the  expected  search  time  may  be  calculated 

if  all  the  filial  set  sizes  are  known.   However s  it  is  inconvenient 

to  use  the  set  of  all  filial  set  sizes  in  macroscopic  calculation. 

For  computational  purposes  an  average  filial  set  size  for  each 

tree  level  is  defined  as: 

number  of  nodes  on  level  i  ,,,. 

mi   = (1) 

number  of  filial  sets  on  level  i 

Since  the  average  time  to  search  a  filial  set  on  the  i th  level  is 

%(m.  +1),  the  expected  time  to  search  a  file  of  n  items  allocated 

as  an  h  level  tree  is 

t  -  %  S_  (mi  +  X)'  (2) 

Z-l 

Equation  (2)  gives  the  expected  search  time  in  terms  of  the  para- 
ge 

A 


meters  m  and  h.  These  parameters  are  related  by  the  expression 


IT 


m.  =  ns 


because  the  product  of  the  filial  set  sizes  is  the  number  of  leaves. 


C37_ 
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The  binary  coding  of  alphanumeric  characters  allows  additional 
flexibility  in  the  variation  of  m.  provided  the  key  elements  may 
be  manipulated  freely.  The  key  may  then  be  considered  to  consist 
of  a  single  string  of  binary  digits  rather  than  several  disjoint 
elements,  each  consisting  of  several  binary  digits.   For  example, 
the  key  "CAT"  is  considered  as  the  binary  string  "010011010001110011" 
rather  than  the  set  of  distinct  elements  "010011",  "010001",  and 
"110011".   With  keys  of  this  format,  the  binary  digits  may  be 
grouped  to  give  the  most  efficient  search  system,  i.e.  m.  and  h 
may  be  selected  without  constraint  (other  than  (3))  to  minimize  t. 
An  example  of  this  division  technique  is  shown  for  several  alpha- 
betic characters  in  Figure  12.  [_37j 

Noting  the  form  of  equations  (2)  and  (3) ,  it  is  clear  that  the 
expected  search  time  is  independent  of  the  order  in  which  the  levels 
are  taken  and  that  the  minimum  search  time  is  achieved  when  the 
m  are  all  equal „   If  the  m,  are  all  equal,  (3)  reduces  to 
m  =  n  and  (2)  becomes 

t  =  h   h(m  +  1)  =  %(m  +  1)  log  n  (4) 

Taking  the  first  derivative  of  this  expression  with  respect 
to  m  and  setting  it  equal  to  zero,  we  find  a  minimum  t  at  m  =  3.6„ 
Equation  (4)  is  normalized  and  plotted  in  Figure  13.   By  simple, 
arithmetic  we  find 

t  =  2.3  log    n  -  1.24  log2  n 
That  is,  the  expected  search  time  in  the  optimum  case  is  only  24 
percent  slower  than  the  binary  search  time.   Moreover,  since  the 
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time  required  to  add  or  delete  an  item  to  or  from  the  file  is 
approximately  the  same  as  the  search  time,  a  considerable.  Improve- 
ment over  the  file  allocated  for  binary  searching  (ordered  alloca- 
tion) . 

Therefore  in  summarizing  the  case  when  it  is  possible  to  freely 
manipulate  keys,  the  key  elements  should  be  selected  so  that 

1)  all  paths  from  a  root  to  a  leaf  have  the  same  length,  h; 

2)  all  filial  sets  have  the  same  number  of  members,  m^ 

and  3)  the  common  filial  set  size  m  is  near  3.6  and  h=log  n„  |37  | 

m     i»   J 

The  above  rules  for  the  adjustment  of  parameters  are  based  on 
the  consideration  of  minimum  search  time  only,   A  more  realistic 
value  may  be  obtained  by  also  considering  the  cost  due  to  the  higher 
storage  ratio  caused  by  small  filial  set  size.   The  total  number  of 
(dummy)  nodes  in  the  tree  which  serve  only  to  index  the  leaves 
(items)  is  A        ; 

j'~l    JL*i  (5) 

If  it  is  assumed  that  all  filial  sets  m.  are  equal  (5)  becomes 

A 

W  m      \       nr1  =  m  EB Z-L)  CL        m    n  (6) 

Equation  (6)  shows  that  as  m  increases,    the  number  of  required 
storage  locations  decreases. 
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We  have  previously  defined  a  meaningful  measure  of  efficiency 
for  storage  allocations  as  the  product  of  search  time  and  storage 
capacity.   Using  (4)  and  (6) 

C  =  t  x  W  =  h   h(m  +  1)1  x   fm  feji  iM") 

=  h   hm  C  «  +  1   )  (mh  -  1  )  (7) 

K  ri  -  1  "v 

Equation  (7)  is  plotted  in  Figure  14  in  normalized  form.   The  minimum 
cost  is  achieved  when  m  =  5.3. 

It  is  useful  to  note  that  in  both  Figure  13  and  Figure  14,  the 
curves  are  shallow  at  the  minima.   The  values  of  m  may  therefore 
vary  from  approximately  three  to  eight  with  less  than  20  percent 
degradation. 

When  it  becomes  necessary  to  use  an  m  greater  than  these 
optimum  values,  the  efficiency  of  the  system  will  suffer.   Numeric 
key  elements  would  be  acceptable,  since  m  =  10.   But  when  the  keys 
are  alphabetic  (m  =  26)  or  alphanumeric  (m  ^  36),  it  becomes 
necessary  to  add  refinements. 

In  the  discussion  of  efficiencies  above,  a  tree  with  chained 
heirs  and  random  chained  sib  sets  was  implied.   We  have  seen  that 
this  sort  of  doubly-chained  tree  requires  two  address  fields 
(one  for  the  heir  link  and  one  for  the  sib  set  link)  plus  a  field 
for  the  label.   A  slightly  more  complex  allocation  in  which  the 
sib  sets  are  chained  in  an  ordered  arrangement  (this  will  permit  a 
binary  type  search  within  each  sib  set)  will  be  efficient  for 
larger  m.   To  accomplish  this  ordering  for  a  binary  search  within 
a  filial  set,  it  is  necessary  to  expand  the  memory  map  (matrix  row 
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vectors)  to  three  addresses  per  node,  corresponding  to  the  states; 
greater  than,  equal  to,  and  less  than.   In  the  two  unequal  condi- 
tions the  address  is  that  of  the  next  sister  node  to  be  examined; 
the  equal  condition  address  leads  to  the  filial  set  entry  at  the 
next  level,  or  if  the  termination  symbol  has  been  recognized,  to 
the  location  of  the  desired  item.  \_2~~\ 

On  a  variable  word  length  computer  this  revision  would  demand 
an  approximate  50  percent  increase  in  storage  requirements.   Less 
flexible  machines  would  probably  require  a  100  percent  increase . 
Even  in  this  latter  case,  the  proposed  method  is  superior  as  we 
shall  now  show. 

Consider  a  filial  set  of  size  m.   If  the  ith  member  of  the 
set  is  selected  as  the  starting  point  in  the  search,  the  original 
set  is  partitioned  into  three  subsets  with  1,  m  -  ns  and  n  -  1 
members.   The  average  search  times  for  these  sets  are  1,  1  +  T^ 


m-n 


and  1  +  T  ,      respectively 0   The  average  search  time  is  therefore 
n-i 


T 

m 


=  l[  1+(1  +  T    )  (m-n  )  +  (1  +  T   ,  )  Cn-l)"I 
L  m-n  n-1        J 

TOT    )  (m-n)  +  (T   )  (n-l)l 
m-n  n-1       J 


m 
Because  of  symmetry,  this  is  equivalent  to 


Z  (T- 


\'1  +  h        (T„-1  > (n- 

m 
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If  we  define  i  =  n  -  1,  we  have 

which  is  equivalent  to  A~ 

yn-i 

'm  wf  I- 


Obviously,  T.  ■  1. 


Now  that  the  expected  search  time  is  known,  relative  cost  can 
be  computed  on  the  same  basis  as  used  previously  (equation  (7)) 
for  direct  comparison.   See  Table  6  for  a  comparison  of  average 
search  time,  relative  search  time,  and  relative  cost  vs.  the  size 
of  the  filial  sets.   With  a  50  percent  increase  in  storage,  require- 
ments assumed  in  computing  relative  cost,  the  proposed  method  is 
superior  when  the  average  filial  set  size  exceeds  9.   If  100  percent 
storage  increase  is  required,  the  break -even  point  is  16„   [2  J 

Another  variation  is  to  allow  unbalanced  trees.   This  type  of 
tree  may  be  utilized  when  it  is  not  possible  to  freely  manipulate 
keys  and  select  filial  sets  to  be  near  optimum  size.  The  path 
lengths  may  then  vary  within  the  tree,  and  the  tree  may  be  con- 
structed using  random  length  key  words.   In  the  case  where  the  main 
storage  is  a  disc,  the  proper  path  length  is  easily  determined 
because  each  leaf  must  govern  T  items  (where  T  is  the  number  of 
items  that  may  be  accommodated  on  one  track  of  the  disc  file).   Thus, 
if  a  particular  node  governs  T  or  fewer  items,  the  branching  along 
that  particular  path  may  be  stopped.   If,  however,  the  node  governs 
more  than  T  items,  the  branching  must  be  continued  for  at  least 
one  more  level. 
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TABU,    6 
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c  (  ost 
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2 
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3 
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.9575 
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4 

2.5 
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1.0043 

.8871 

1 .0328 
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5 

3.0 

2.479990 

1.0381 

.8581 
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3.5 
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1.0878 
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1.1722 
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.8375 
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8 

4.5 

3.115177 

1.2052 

.8343 

1.0023 

1.1031 

!) 

5.0 

3.280595 

1.2073 

.8330 

1 .0990 

1 . OS  1 2 

JO 

5.5 

3.443729 

1.3302 

.8329 

1 .1400 

1.0700 

12 

(i.5 

3.723022 

1.4508 

.8345 

1 . 2257 

1 .0532 

1-1 

7.5 

3.907032 

1 . 5827 

.8372 

1.3140 

1.0431 

10 

8.5 

4.184048 

1.7073 

.8-104 

1 . 1040 

1.0371 

IS 

9.5 

4.378500 

1.8304 

.8430 

l.-l  94S 

1.0334 

20 

10.5 

4.555252 

1.9520 

.8408 

1.5847 

1.0312 

25 

13.0 

4.937190 

2.2492 

.8542 

1.8070 

1.0294 

30 

15.5 

5.250304 

2.5380 

.8600 

2.0250 

1.0300 

35 

18.0 

5.530518 

2.8190 

.8003 

2.2380 

1.0317 

40 

20.5 

5.771009 

3.0949 

.8712 

2.4482 

1.0338 

45 

23.0 

5.985222 

3.3049 

.8750 

2.0542 

1 .0300 

50 

25.5 

0.178372 

3.0302 

.8795 

2.8570 

1.0*383 

55 

28.0 

6.354258 

3.8913 

.8830 

3.0508 

1.0405 

GO 

30.5 

0.515730 

4.1487 

.8802 

3 . 2540 

1.0427 
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Frequently  there  are  subtrees  which  are  also  chains,  i.  e.  the 
filial  set  of  each  node  in  the  subtree  consists  of  a  single  node. 
For  example,  when  using  English  language  words  as  keys  the  commonly 
occurring  suffixes  such  as  ing,  tion,  ain,  ed  and  ly  seldom  span 
more  than  one  item.   Such  chains  waste  both  storage  space  and  search 
time.   Indeed,  if  a  node  governs  only  a  few  items,  instead  of  con- 
tinuing the  tree  in  the  normal  manner  from  that  node--  entailing  the 
use  of  several  levels  to  reach  the  items--  the  items  should  be  assigned 
to  the  nodes  of  the  filial  set  of  that  node  and  no  additional  levels 
used.   By  doing  this,  fewer  nodes  are  traversed  to  reach  the  leaves, 
thereby  reducing  both  the  required  storage  space  and  the  search 
time.   A  variable  length  value  field  is  required  for  these  leaf 
nodes,  however,  because  all  the  key  elements  of  the  item  not  already 
assigned  in  the  partial  key  of  the  parent  node  must  be  assigned  the 
values  of  the  leaves. 

Thus,  in  this  case  it  is  of  interest  to  determine  at  which  nodes 
the  branching  should  be  terminated  to  achieve  a  more  efficient 
search  time  and  a  reduced  storage  requirement.   A  path  leading  to 
a  block  of  items  is  considered  to  have  its  optimum  length  if  the 
expected  search  time  for  items  in  the  block  is  increased  when  the 
path  length  is  changed.   The  computation  of  this  optimum  path  length 
is  based  on  the  number  of  items,  g,  which  a  node  x  governs  and  the 
number  of  nodes,  m,  in  its  filial  set.   The  nodes  of  the  filial 
set  of  x  each  govern  an  average  of  g/m  items.   If  the  branching  is 
discontinued  at  node  x,  the  expected  search  time  x  is 
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td  -  k  (g+  Do 

If,  however,  the  branching  is  continued  for  one  more  level,  the 
expected  search  time  from  x  is 

t  =  h   (m  +  g  )  +  1 . 

m 

Thus,  the  expected  search  time  from  x  will  decrease  if  branching 
is  continued  when  t .   ^  t   .  F  37  J 

Figure  15  displays  the  relation  between  t  and  t   ;  for  any 
node  with  its  (g,m)  lying  in  the  pie -shaped  region  to  the  right 
of  the  solid  curve,  the  search  time  can  be  decreased  by  continuing 
the  branching  from  that  node.   The  dashed  curves  indicate  the  rela- 
tive improvement  in  the  search  time  between  the  cases  of  continuing 
and  discontinuing  the  branching  from  the  node.   As  g  ^  m,  the 
criterion  as  to  whether  to  branch  or  not  given  in  Figure  15  may 
be  approximated  by  stating  that  the  branching  should  be  continued 
from  any  node  which  governs  more  than  six  items.  [_37j 

Given  a  situation  where  the  size  of  h  and  m  are  either  (  or  both  ) 
required  to  be  larger  than  optimum,  we  are  then  faced  with  a  tree 
which  is  to  some  degree  vacant „   An  analysis  of  the  manipulations 
and  searches  associated  with  such  a  tree  forces  one  to  rely  primarily 
on  the  mathematics  of  probability.   Scidmore  and  Weinberg  have 
developed  an  analysis  of  trees  from  this  viewpoint.   []33J    The 
analysis  includes  the  situations  1)  where  keywords  are  chosen 
randomly  from  those  with  lengths  of  one  through  nine  elements,   and 
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2)  where  all  keywords  are  the  same  length.   Values  of  m  =  10,  26 , 
and  50  were  used  to  simulate  decimal  numbers,  alphabetic  characters, 
and  alphanumeric  characters  respectively. 

In  the  case  where  the  keywords  were  considered  all  the  same 
length,  the  results  of  the  analysis  for  the  predicted  mean  search  S 
correspond  exactly  with  results  previously  derived  for  t  (recall 
that  in  our  previous  development  we  considered  t  as  a  direct  function 
of  the  number  of  item  times,  i.  e.  the  number  of  links  traversed, 
t  is  therefore  identical  to  S).   A  plot  of  the,  mean  search  vs.  the 
logarithm  of  the  fraction  of  the  tree  in  use  is  shown  in  Figure  16 
for  several  values  of  m  and  L  (L  here  is  equivalent  to  h) .   It.  is 
apparent  that  when  there  are  very  few  sequences  stored,  the  mean 
search  is  approximately  L  and  somewhat  independent  of  m.   However, 
as  the  number  of  stored  sequences  approaches  m   (same  as  m  ) , 
the  mean  search  approaches  %  L(m  +  1)  as  noted  before. 
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Figure  16 


8.    Comparison  of  modified  trees. 

A  useful  modification  to  the  more  straightforward  tree,  struc- 
tures already  described  is  the  Trie .    Fredkin  has  proposed  a 
scheme  in  which  the  heirs  are  chained  and  the.  sib  sets  are  arranged 
in  blocks.   T8J    The  basic  feature  is  that  whenever  a  node,  on 
level  i  requires  an  heir  on  level  i  +  1,  a  block  large  enough  to 
contain  an  entire  filial  set  is  established.   The  heir  address  in 
level  i  is  the.  address  of  the  first  logical  member  of  the  filial 
set  and  all  other  members  of  that  set  are  understood  to  exist  in 
sequence  in  the  succeeding  memory  locations.   Therefore,  each 
computer  word  corresponding  to  a  node  need  only  contain  the  heir 
address  since  the  node's  label  is  implicit  in  its  position  within 
the  sib  set.   The  advantage  in  machines  and  applications  where  word 
size  is  a  problem  is  obvious.   Also,  retrieval  from  a  trie  structure 
requires  one  indexing  manipulation  for  each  level  of  the.  tree,  not 
an  item-by-item  search  of  the  nodes  as  in  the  tree  structure. 
Thus,  the  expected  search  time  is  proportional  to  the  average  path 
length,  h,  rather  than  %h  (m  +  1).   This  speed  advantage  is  com- 
pensated by  requiring  more  storage  locations,  however. 

The  trie  structure  is  introduced  at  this  point  since  it 
represents  a  middle-ground  compromise  between  the  two  extremes  in 
tree  structures,  i.e.  the  balanced  tree  and  the  completely  elided 
unbalanced  tree.   It  therefore  provides  a  fulcrum  for  comparison 


The  word  trie  is  taken  from  the  word  retrieval 
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of  the  various  types.   See  Figure  17  for  a  pictorial  comparison 
between  these  three  basic,  types  under  a  storage  condition  where  the 
majority  of  the  balanced  tree  is  vacant. 

Let  d  be  the  average,  number  of  elements  utilized  in  each  filial 
set  where  d  is  a  subset  of  the  m  possible  elements  in  a  filial  set. 
Note  that  in  a  balanced  tree,  d  =  m.   Computing  the  number  of 
storage  locations  W,  we  find; 


m"* 


For  a  balanced 

tree 

X 

v   ■ — 

w  =  "> 

m 

A.-I 

For  a  trie 

m 


,2  >-l  dh  -  1 

W  =  m+dm  +  dm  +  . . .  +  d    m=m 


For  an  elided  unbalanced  tree 

•«.       .  h 

„-    T    d*    =   a  d    -  * 

d   -  1 


i-1 
Since  d  K.    m, 


dh  -    i      y        mh   -  1 

d  e         m  S        m 


ih  - 1     y  m  dh  -  i  y 

i    -  i      x       d    -    l     ^ 


m 


or 


elided    N       trie    \    balanced 
From  this  simple  comparison,  it  is  obvious  that  the  ratio  of 
d/m  is  a  convenient  index  to  the  preferred  tree  structure.   This 
index,  however,  is  sensitive  only  to  the.  storage  criteria  and 
neglects  the  equally  important  factor  of  search  time.   One  of  the 
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advantages  of  the  trie  is  a  faster  search  time  so  that  it  is  even 
more  attractive,  than  the  elided  tree  in  certain  instances. 

The  balanced  tree  finds  its  place  in  applications  such  as  the 
multi-list  structure  developed  at  the  University  of  Pennsylvania. 

Ll8j   The  tree  acts  as  a  translator  whose  inputs  are  coded 
information  words  and  whose  outputs  are  addresses  of  lists  on  which 
items  of  information  are  stored.   The  keys,,  assumed  to  enter  the. 
system  in  random  order,  are  inserted  in  the  tree  so  that  the  ipobo- 
tonic  order  of  key  values  is  preserved  and  the  balanced  growth  of 
the  tree  is  continuously  maintained.   The  leaves  of  the  tree  lead 
into  the  Multi -Association  Area  as  lists.   Each  list  is  a  string  of 
items  that  have  at  least  on  key  in  common. 

A  technique  which  is  roughly  equivalent  to  an  h-level  balanced 
tree  is  the  addressing  of  multidimensional  arrays  as  linear  arrays 
or  vectors.  This  technique  provides  a  means  for  partitioning  a 
file  into  subfiles  which  is  simpler  than  the  tree  structure  and 
also  permits  more  rapid  entry  into  a  subfile  than  the  tree.   L1^   L 1 2 J 
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9.   Fibonaccian  searching. 

The  standard  binary  search  of  an  ordered  list  for  a  given 
argument  consists  of  comparing  the  argument  with  the  middle  item  of 
the  list  in  order  to  isolate  the  item  desired  to  a  list  half  as  long 
as  the  original  list„   This  process  is  repeated  until  the  item  is 
found.   As  we  have  seen,  the  number  of  trials  required  is  of  order 
log-  n  for  a  table  of  n  entries.   If  it  is  desired  to  save  memory 
space,  the  addresses  for  the  successive  trials  are  computed,  but 
this  requires  time  consuming  division  or  multiplication  by  an  inverse 
in  machines  having  no  binary  shift.   A  table  of  the  powers   of  two 
is  stored  and  used  if  a  program  which  is  conservative  of  time  is 
desired. 

One  method  to  conserve  both  time  and  storage  space  utilizes  the 

Fibonacci  numbers  defined  by: 

u.  =  i,  i  <  2 

1 

u.  -  u.   +  u,  „      i  >  2 

l    l-l    i-2  , 

The  important  distinction  here  is  that  the  successive  increments  are 

found  by  subtraction.  [_6J 

If  at  some  point  in  the  process  the  item  has  been  isolated 

to  an  interval  of  size  u,   beginning  at  A,  compare  x  to  C,  .   . 

l  i-l  where 
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x  =  value  of  the  argument 


C.  .,   *  contents  of  A+ u.  , 
l-l  l-l 


If  x<C._1, 


the  item  is  now  isolated  to  an  interval  of 


size  u.  1  beginning  a  A,  hence  replace  i  by 
i-1  and  repeat  the  process. 


If  x  >C.  n  ,    the  item  is  now  isolated  to  an  interval  of 
'     l-l 

size      u.  -  u„  1  =  u     beginning  at 
l     i-l     ±-Z 

A  +  u,   ,  so  replace  i  by  i  -  2,  A  by  A+  u.  - 
and  repeat  the  process. 

It  is  easily  verified  by  induction  that 

u= '  '  - 

n     yp 
where       .    /  y.  'fip 

0s 


For  large  n, 


and 
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In   the  worst   case,    the   reduction  factor  R        (defined   as   the   ratio 
of   successive   intervals)    is    <p    ,    requiring 


In   the   first  case    (if  x  is   not   found) ,   R     is     <^>       ,   requiring 


yyi  r  =— *h 

V         3tJU4> 


Clearly,  the  expected  value  of  R  is  pR  +  qR   ,  where 
p  is  the  probability  of  R   and  q  is  the  probability  of  R 


giving 

J_ 

as  in  the  binary  search. 


n-(i)* +  &<*>'* 


Therefore,  for  searching  an  ordered  list  having  n  elements ,  the 
expected  searching  time  is  of  order  log   n  and  the  maximum  searching 
time  is  of  order  log  ,      n.  j_6J 
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APPENDIX 
CLOSED  FORMS  OF  SUMMATIONS 


v-w 

k*l    ••■■ 

(2) 

V^    k!   ,  o(ca-  a-  1)  ♦  a 
4i  ck             oa(o  -  I)2 

(3) 

V  ,      k       e^fea  -  a  -  1)  ♦  a 

k.           (.-«2 

All  of  those 

expressions  may  be  verified  by  induction.     For 

example,  in  (1)  let 

.:                      '•'•       '-•     .  . 

si-Z°k- 

kal 

Then  S,  *  c  and  S,  *  S.   ,  -  o    must  be  satisfied. 

(2)  may  be  derived  from  (1)  by  the  following  manlpulationi 

ib     i  b     i  b 

a+l  a*l 

■  V  k      V  k-1      a*l 

4^   i  **""*** 

8*1 

If?     b*71 

(3)  may  be  derived  from  (2)  directly. 


